Introduction

Messenger RNAs (mRNAs) of viruses and species from all the three domains of life generally tend to be purine (R)-rich (Szybalski et al. 1966; Smithies et al. 1981; Bell and Forsdyke 1999; Lao and Forsdyke 2000; Paz et al. 2004). The reason for this phenomenon is not well understood. The “politeness” hypothesis (Zuckerkandl 1986) assumes that R-loading in mRNAs may hinder the formation of “forbidden” double-stranded RNA (dsRNA), which severely disrupt translation and trigger various intracellular alarms (Forsdyke 1999). During initiation of translation, i.e., the rate-limiting step to protein synthesis (van der Velden et al. 2002; Arava et al. 2003), hairpin structure in the mRNA causes the 40S ribosomal subunit to stall, and thus translation is slowed down or even stopped (van der Velden et al. 2002; Kozak 2005). From first principles, the formation of dsRNA by (inter- and intrastrand) Watson–Crick base-pairing might be lower in R-loaded mRNAs than in mRNAs with equal amounts of purines and pyrimidines. Thermophilic prokaryotes have a high purine content and R-tract abundance in their mRNA, compared to mesophilic species (R-tract—tract of sequential purines with R n ≥ 5) (Paz et al. 2004). It was hypothesized that the high R-bias of the thermophiles mRNA is a thermoadaptation (Lao and Forsdyke 2000; Lobry and Chessel 2003; Paz et al. 2004). Short-range purine-pyrimidine sequence composition was found to be an important characteristic of species genome organization, on the above-gene level (Paz et al. 2005).

In both thermophilic and mesophilic prokaryotes, mRNAs of ribosomal proteins (RPs) and heat shock proteins (HSPs) have significantly higher R-bias and R-tracts abundance than their species average (Paz et al. 2004). Similarly, mRNAs of prokaryotic amino-acyl tRNA synthetases (ARSs) also have higher R-bias and R-tract abundance than their species average (A. Paz, unpublished results). We want to check whether in the eukarya, as is the case in the prokarya, the mRNAs of three groups of highly expressed genes, RPs, HSPs, and ARSs, have higher R-bias and R-tract abundance than the species average. The major difference between the maturation of mRNA in eukarya compared to prokarya, i.e., the splicing of precursor mRNA in the eukarya, prevents direct extrapolations and expectations for similarity. This is because there might be a need for splicing signals in the exons (besides the cis-component sequences located in the introns needed for splicing), which can influence the R-Y composition of the exons. The central importance of the splicing process to the expression level of genes in eukarya poses additional interesting questions: Does the R-Y composition of introns of the three targeted groups differ from the species average? If indeed the R-Y composition of both exons and introns of the three targeted groups differ from the average, then how can these differences be attributed to the higher expression of the former gene groups compared to the expression of “average” genes? (One can speculate that a specific R-Y composition of pre-mRNA might enable a high rate and efficiency of splicing.)

In the current study we analyze precursor RNA (pre-mRNA) of 12 eukaryotic species, regarding R-Y (content and distribution) within exons and introns. We compared the three protein-coding gene groups, RPs, HSPs, and ARSs, with a control group of randomly selected protein-coding genes. Based on the literature (see below), the expression level of the three targeted groups of genes is higher than the expression of “average” genes of the species. Thus, we referred to the former genes as targeted highly expressed genes (THEGs). An additional feature that these genes share is their cooperation in protein synthesis and maturation.

High expression of genes that belong to the three targeted groups was reported in expression sequence tag (EST) studies of various eukaryotes, including the unicellular yeast Saccharomyces cerevisiae and multicellular species (Herruer et al. 1987; Warner 1999; Seshaiah and Andrew 1999; Warrington et al. 2000; Hsiao et al. 2001; Yu et al. 2001; van Ruissen et al. 2002). Fifty percent of the estimated RNA-polymerase II-mediated transcription initiation events in the yeast involve RP genes (Warner 1999). In many tissue types of the metazoan, the expression of RPs is high (ranked within the highest third), possibly to fit the requirements of a high amount of ribosomes during elevated protein synthesis. HSPs and ARSs are also involved in the process of protein synthesis and maturation. The expression levels of many HSP genes are not exclusively related to stress, e.g., high temperature. Indeed, many HSP chaperones are expressed constitutively. In Saccharomyces cerevisiae and mammals, various HSP chaperones, possibly including the nascent polypeptide-associated complex, interact with ribosomes in processing and in protecting nascent polypeptides exiting the ribosome (Fewell et al. 2001; Frydman 2001; Hartl and Hayer-Hartl 2002). Eukaryotic ARSs are involved in noncanonical (noncatalytic) but very important functions related to proteins synthesis (Lee et al. 2004; Park et al. 2005). In particular, some ARSs form a complex with the translation elongation factor EF-1H complex (Bec et al. 1994; Sang Lee et al. 2002). This complex has also been shown to functionally interact with EF-1a (Negrutskii et al. 1999). The proposed role of this complex is to facilitate the delivery of the charged tRNA to the ribosome.

The Main Questions of the Study

The three main questions of the study are as follows.

  1. 1.

    Is there a trend of eukaryotic mRNAs of RPs, HSPs, and ARSs to have higher than average R-bias and R-tracts abundance?

  2. 2.

    Do the introns of these gene groups differ from average genes in the R-Y composition? If the answers to the two questions are positive, then how are the differences in the R-Y composition attributed to the higher level of expression of the former gene groups compared to “average” genes?

  3. 3.

    Are there any differences between lower and higher eukaryotes in the patterns of sequence organization, with respect to the R and Y content and homotract distribution in pre-mRNAs of THEG genes, “average” genes, or both?

We expected that THEG mRNA will have higher than average R-bias and higher than average abundance of R-tracts. These features (if they exist) might contribute to both a higher efficiency of splicing and a reduction of translation disturbances. These two roles of R-biased mRNA do not exclude the coding role of the purine-rich sequences, contributing to an increased level of charged amino acids, needed for THEG protein functions. We also had preliminary assumptions about the R-Y composition of the introns: according to Chargaff’s second parity rule (Karkas et al. 1968; Rudner et al. 1968), there is an approximate equality in the nucleotide content of single-stranded DNA (%A = %T and %C = %G). This means that there should also be equality of %R to %Y (as R = A+G and Y = T+C). Although this rule was confirmed in a long-range analysis of single-stranded DNA of many species (Prabhu 1993), the aforementioned purine bias of mRNA (Szybalski et al. 1966; Smithies et al. 1981; Bell and Forsdyke 1999; Lao and Forsdyke 2000; Paz et al. 2004) means that in short range, there are deviations from this parity rule. It is interesting to check whether within pre-mRNA sequences there is a trend for “compensation” of large deviations from Chargaff’s second parity rule within the exons via specific nucleotide content of introns. This question might be especially relevant to the introns of THEGs, if our assumption that their mRNAs are highly R-biased is indeed correct. Therefore, introns are pyrimidine biased in general, and we hypothesize that this bias may be higher in genes with exceedingly purine-biased mRNA and that the contrasted R-Y-biases of THEG exons and introns contribute to high expression of THEGs.

Materials and Methods

We analyzed pre-mRNAs of 12 eukaryotic species: Arabidopsis thaliana, Caenorhabditis elegans, Drosophila melanogaster, Anopheles gambiae, Apis mellifera, Ciona intestinalis, Danio rerio, Takifugu rubripes, Xenopus tropicalis, Gallus gallus, Mus musculus, and Homo sapiens. For the 11 metazoan species, the sequence and annotation data were retrieved from the Ensembl Genome Browser (http://www.ensembl.org/index.html). For the plant Arabidopsis thaliana we used gene models and annotations of TAIR (http://www.arabidopsis.org/index.jsp) and MIPS (http://www.mips.gsf. de/proj/thal).

Gene selection

The analyzed genes belong to two groups: THEGs (1041 genes) and genes used as control (1200 genes). The genes of the control group were selected randomly from the annotated databases of the 12 species. We use the abbreviation RSGs (randomly selected genes) for this group. Only intron-containing genes were selected. In each of the 12 species, the total number of analyzed RSGs was at least equal to the number of THEGs.

THEGs

The THEG group included five subgroups of genes, encoding RPs, mitochondrial RPs (mtRPs), HSPs, HSPs with a prokaryotic-like HSP40 signature (DNAJs), and ARSs.

RPs and mtRPs

The RP group has a larger number of annotated genes than other THEG groups. The number of predicted ribosomal proteins is approximately 80 (although some species might have <80 RPs); for the mitochondrial RPs the predicted number is smaller. When more than one annotated gene encoding for a certain RP was found in the database for a species, we arbitrarily chose only one of these genes.

HSPs

This group was classified into six protein families on the basis of molecular mass. These included (1) high molecular weight HSPs (100 kDa); (2) the HSP90 family; (3) the highly conserved HSP70 family, which represents the most prominent eukaryotic group of HSPs; (4) the HSP60 family of chloroplasts and mitochondria; (5) eukaryotic homologues of E. coli DnaJ (HSP40); and (6) the family of small HSPs, members of which are expressed predominantly in plants (Caplan et al. 1993).

ARSs

This is the smallest group; in each species there are only 21 enzymes that belong to this group. All the annotated genes of HSPs and ARSs within the species data were included because the relatively small size of this group compared to the RP group. If several alternative transcripts were presented in the database, we arbitrarily chose only the first suggested annotation found in the database site.

For the analysis of the nucleotide content and sequence organization of pre-mRNA, including purine and pyrimidine tracts, we used our own C++ program (ART-5). The statistical calculations were conducted using the STATISTICA 6 program.

Results and Discussion

The main results of our analysis can be concisely summarized as follows. (1) THEG mRNAs have significantly higher purine bias and higher percentages of purines in R-tracts than the control group of genes (Fig. 1). THEG mRNAs also have lower percentages of pyrimidine in Y-tracts than the control (Y-tract—tract of sequential pyrimidines with Y n ≥ 5). (2) The level of purine bias in mRNA of eukaryotic species does not seem to be correlated with the optimum growth temperature (OGT) of the species (for poikilotherms) or the body temperature (for homeothermic species). (3) Purine bias of “average” mRNAs in mammals is the lowest among the tested eukaryotes (it is reflected in high proportions of pyrimidines in tracts within exons). (4) THEG introns display a higher pyrimidine bias than the species average in which the splicing of (many or most) pre-mRNAs is based on “intron definition” (the invertebrates and A. thaliana). This trend weakens toward the vertebrates and disappears, or even becomes replaced by an opposite trend, in species with a high rate of alternative splicing (mammals and chickens). (5) In all examined species, the introns of THEGs have a lower abundance of R-tracts compared to the control.

Fig. 1.
figure 1

Distribution of purine content in exons of the targeted groups of highly expressed genes (THEGs) and randomly selected genes (RSGs; control). R-percentage, purine percentage; No. of obs, number of observed exons with the indicated percentage of purines. Black columns: exons of ribosomal proteins, heat shock proteins, and aminoacyl tRNA synthetases (THEG group). Gray columns: exons of RSGs.

THEG Subgroups Have a Higher Exonic Purine Content, Higher Abundance of Pure R-Tracts, and Lower Abundance of Pure Y-Tracts, Compared to Control Groups

We confirmed the already known R-bias of mRNA (Szybalski et al. 1966; Smithies et al. 1981; Bell and Forsdyke 1999; Lao and Forsdyke 2000; Paz et al. 2004). This trend was found in each of the THEG subgroups as well as the RSG groups (Table 1, column 2). In accordance with our expectation (see Introduction), the average purine content in mRNA of THEG genes was found to be higher than that in RSGs (see Table 1, columns 2 and 3, for detailed data and Table 2, columns 3 and 4, for summarized data). As shown in Table 1 (columns 2 and 3), this trend was found in 40 of 43 cases (93%) of comparisons between the THEG subgroups and the RSG average. For the overall trend, the deviation from H0 of no trend is significant using the Mann-Whitney U test at p < 10−6, albeit this difference was significant in only 21 of the individual cases (49%). The small number of annotated THEGs of Ciona intestinalis , Takifugu rubripes, Xenopus tropicalis, and Gallus gallus seems to be the cause of the nonsignificant difference between these genes and the control group. mRNAs of only two THEG subgroups, the RPs of D. melanogaster and A. thaliana, display significantly lower purine bias than those of the control group (see below). As also shown in Table 1 (columns 4 and 5), in 42 of 43 cases (98%), the mRNAs of THEG subgroups have higher percentages of purines within tracts compared to the control (p < 10−6). In individual comparisons, this difference was significant at p < 0.05 in 35 of 43 cases. Columns 6 and 7 in Table 1 show that, compared to the control group, the mRNAs of THEG subgroups tend to have lower percentages of pyrimidines within Y-tracts (significance of the overall test, p < 10−6); in the individual cases this difference is significant (p < 0.05) for only 16 of 43 subgroups. Correspondingly, the ratio (% nucleotides in R-tracts/% nucleotides in Y-tracts) is higher in THEG exons compared to RSG exons (control) in all the species examined, except D. melanogaster (Fig. 2).

Table 1. Purine content and R- and Y-tract contents of THEG and RSG mRNAs
Table 2. Comparison of purine content of the mRNAs of THEG genes and RSGs of Arabidopsis, invertebrates, and chordate species
Fig. 2.
figure 2

R- and Y-tract composition of pre-mRNAs of THEG and RSG groups of 12 examined eukaryotic species. THEG, the three targeted highly expressed genes; RSG, random selected genes (control group). Black and hatched columns: the ratio of the percentage of nucleotides in R-tracts to the percentage of nucleotides in Y-tracts in THEG and RSG exons, respectively. Gray and striped columns: the ratio of the percentage of nucleotides in Y-tracts to the percentage of nucleotides in R-tract in THEG and RSG introns, respectively.

The mRNAs of D. melanogaster and A. thaliana RPs display significantly lower purine bias than those of the control group. For both species, this is mainly because of the relatively high pyrimidine content of the 5′ untranslated regions (UTRs), whereas the coding regions still have a significantly higher purine content than the control group (data not shown). The excess of pyrimidines in Y-tracts within the D. melanogaster RPs mRNAs is also significantly higher than that of the control (Table 1, underlined). The 5′ UTRs of D. melanogaster RPs mRNAs include a pyrimidine tract with a length of 6–13 bases (not shown). These terminal oligopyrimidine tracts (TOPs) are involved in regulation of translation of RPs (Meyuhas 2000; Marygold et al. 2005). Mammals and other vertebrates also have 5′ TOP within the UTR of the RPs (Meyuhas 2000; Perry 2005). According to our results, within the vertebrates, the R-biases of both the coding and the noncoding regions of the RPs mRNAs override the small local pyrimidine bias caused by the 5′ TOP, resulting in an overall purine bias larger than that of the RSG mRNA.

The relatively low R content of A. thaliana RPs mRNAs is mainly displayed as an excess of C over G in the 5′ UTR. These results are in accordance with previous reports on GC-skew (excess of C over G) within the A. thaliana genes near the transcription start sites penetrating into the first exons that might be a transcription signal. It is larger in highly expressed genes compared to genes with low expression (Tatarinova et al. 2003; Fujimori et al. 2005).

Species Ecological Conditions and Purine Content in mRNA

As shown in Table 2 (columns 3 and 4), purine content in mRNA is not correlated with the species OGT of the poikilothermic species or the body temperature of the homeothermic vertebrates. However, it is noteworthy that the bee A. mellifera displays higher purine content within its mRNA compared to the three other analyzed invertebrates. The difference is significant (p < 10−6 using Mann-Whitney U test) for both the THEG and the RSG groups. The temperature inside the natural hives of A. mellifera is high (32o–36°C [Tautz et al. 2003]) and sometimes can reach 40°C. These temperatures are much higher compared to the OGTs of C. elegans, D. melanogaster, and A. gambiae (20°, 22°, and 26°C, respectively). We speculate that the difference between the bee and the other three invertebrates in the preferable temperature of their surroundings may be the reason for the significantly higher purine content of the bee mRNAs. Other stresses could also be involved, e.g., more prone to infections in social insects, as suggested by one of the reviewers of the manuscript.

The HSPs and DNAJs of A. thaliana have a higher R content and purines in tracts compared to the control (Table 1, columns 2–5). A. thaliana OGT is ∼20°C, but the surrounding temperature occasionally reaches 40°C (Hong and Vierling 2000). Therefore, there might be a special need in adaptation of the chaperones to high temperatures (on both the mRNA and the protein levels). The high R and R-tract contents of HSP and DNAJ mRNAs might enable better stability of these mRNAs at high temperatures, hinder the formation of dsRNA (a risk that is elevated at high temperatures), promote (indirectly) the encoding protein stability and enhance their chaperone function (see Paz et al. 2004).

Patterns of Purine and Pyrimidine Composition in Introns

In Table 3 (column 2) we summarize the results of testing our assumption that introns are pyrimidine-biased. The purine content of introns is (on average) less than 50%. This was found in all 12 species, in all THEG subgroups, and in the control groups. Moreover, in the plant and invertebrates, THEG introns generally have lower percentages of purines compared to RSGs (control). This trend weakens toward vertebrate species and disappears, or even becomes replaced by an opposite trend, in chicken and mammals: the introns of the chicken, mouse, and human genes (of both THEG and RSG groups) have the lowest ratio of the percentages of nucleotides in Y-tracts to the percentages of nucleotides in R-tracts compared to all other examined species (Fig. 2). As shown in Table 3 (columns 2 and 3) and summarized in Table 4 (columns 2–4), the average purine content of THEG introns is lower than that of the control group in the tested plant, invertebrate, fish, and frog genes (in total, in 29 of 30 cases). For the overall trend, the deviation from H0 of no trend is significant (p < 10−6 using the Mann-Whitney U-test).

Table 3. Comparison of purine content and R- and Y-tract contents of the introns of THEGs and RSGs of 12 eukaryotes
Table 4. Comparison of purine content of the introns of THEGs and RSGs of Arabidopsis, invertebrates, and chordate species

Our analysis shows that THEG introns have lower percentages of purines in R-tracts compared to the control genes. In particular, the average percentage of purines in R-tracts in THEG introns was lower than that of the control group in 39 of 42 cases (p < 10−6); at the individual level of analysis the difference was significant at p < 0.05 in 14 of 42 cases (see Table 3, columns 4 and 5).

Although we did not find a general trend of “compensation” to high deviations from Chargaff’s second parity rule, it seems that compensation may exist in species that have relatively large exons and the ratio of exon average length to the introns’ average length is between ∼0.5 and 2 ( Fig. 3).

Fig. 3.
figure 3

Purine percentage of the introns of pre-mRNA as a function of the purine percentage of exons and the average exon length. Species included are A. thaliana, C. elegans, D. melanogaster, A. gambiae, A. mellifera, C. intestinalis, D. rerio, T. rubripes, and X. tropicalis. Rex, purine percentage of mRNA; ExAvL, average length of the exons of the mRNA (number of nucleotides in the mRNA divided by number of exons); RInt, purine percentage of the introns of the pre-mRNA.

Purine Bias in “Average” mRNAs in Mammals Is the Lowest Among the Tested Species

As shown in Table 2, column 4, the purine content of mouse and human mRNAs of “average” genes (control group) is significantly lower than that of all of the other 10 eukaryotes examined. The difference is highly significant (p < 0.00005 for mouse, and p < 10−6 for human). Moreover, the lower R content of the human RSGs compared to those of the mouse is also significant (p < 0.005). The relatively low R-bias of the average genes in mammals is reflected in high percentages of pyrimidines in tracts within the mammalian RSG exons compared to the other eukaryotes (p < 0.003 for the mouse and p < 0.0006 for the human), although no difference between mouse and human was found in this respect (p > 0.7). We suggest that the revealed difference between mammals and other species may be related to the elevated levels of splicing control and alternative splicing in mammals. As shown in Fig. 2, the exons of mouse and human RSGs have the lowest ratio of the percentages of nucleotides in R-tracts to the percentages of nucleotides in Y-tracts compared to all other examined species.

The Purine Content of THEG mRNA Seems to Reflect the Major Trends in the Evolution of Splicing

The purine content of THEG mRNAs of A. thaliana and invertebrates proved lower than that of the nonmammalian chordate THEG genes (p < 10−6 by Mann-Whitney U-test for the two comparisons; see Table 2, column 2). We suggest that the difference in purine content is related to the major use of “exon definition” in the splicing process (Robberson et al. 1990; Berget 1995) within the chordate, compared to a very low or moderate use of “exon definition” within A. thaliana and invertebrates. What seems to be a contradiction to our last statement is the finding that the purine content of mammals THEG mRNA is not significantly different from that of the invertebrates (p > 0.5) but is significantly lower than that of the other chordate species (p < 10−6). The lower purine content of the mammalian THEG mRNAs is reflected in higher percentages of pyrimidines organized as pure Y-tracts (p < 10−6) compared to other chordates. We suggest that this trend may reflect the elevated levels of splicing control (including alternative splicing) within the mammals. As discussed in the literature, the pyrimidine tracts in the exons are the target sequences of the splicing control system.

We showed previously (Paz et al. 2004) that within prokaryotic thermophiles, the high R-bias and elevated levels of R-tracts of the mRNA are an important evolutionary adaptation to life at high temperatures, and that in the mRNAs of the highly expressed genes, RPs, HSPs, and ARSs, these features are even more pronounced than in average genes. In this study, we showed that in eukarya the R-Y composition and organization of the pre-mRNAs of RPs, HSPs, and ARSs (of both exons and introns) are significantly different from that of the average genes. These differences can be attributed to the high level of expression of the three targeted gene groups relative to the control group. The specific R-Y composition of THEG mRNAs enables lower levels of mRNA secondary structures to be formed (structures that slow down the translation rate), higher frequencies of sequences that are targets for proteins that enhance splicing, and lower frequencies of sequences that are targets for proteins that suppress splicing. The differences in R-Y composition of the introns of THEGs and average genes can be attributed also to differences in the control of splicing; the introns of THEG genes indeed show lower levels of R-tracts (which might suppress constitutive splicing of the introns). The two major trends in the evolution of splicing, the switch from high use of “intron definition” in the invertebrates to “exon definition” in the vertebrates and the higher use of alternative splicing within the mammals, had influenced the R-Y compositional design of all pre-mRNAs, in general, and the pre-mRNAs of the highly expressed genes, in particular. We suggest that the switch from high use of “intron definition” (in A. thaliana and invertebrates) to splicing based on “exon definition” (in the vertebrates) is the reason for the elevated levels of purines within the vertebrate THEG mRNAs compared to the former species. In the “second wave” of splicing evolution, the higher eukaryotes evolved elevated rates of alternative splicing. Control and repression of exon splicing might be correlated with an elevated abundance of pyrimidine tracts within exons. Pyrimidine tracts are target sequences for pyrimidine tract binding protein (PTB), a mechanism proposed to be the major contributor to exon splicing silencing (Wagner and Garcia-Blanco 2001). We showed that in mammals, compared to other vertebrates, there is a reduction of purine content and increased level of pyrimidines in tracts (in pre-mRNAs of both THEG and control groups). This might be related to the needs for more flexible splicing control and higher levels of alternative splicing in mammals.

Possible Functions for the Increased Content of Purines and Purine Tracts Within the Exons of THEG Genes

The most obvious explanation for the observed pattern is that THEG proteins need a high abundance of charged amino acids within their sequences compared to “average” proteins. Lysine and glutamic acids are encoded by pure purinic codons. The two other charged amino acids, aspartic acid and arginine, are also encoded by relatively purine-rich triplets (12 of the 18 possible nucleotides of the six arginine codons and 4 of 6 nucleotides of the two codons for aspartic acid). This consideration of the coding potential of R-rich and R-tract-rich mRNAs cannot rule out additional roles for this R-bias. Thus, compositional organization of exon sequences might be adapted to a high efficiency of both splicing and translation. The possibility of the coevolution of codon usage and splicing enhancers was suggested earlier (Schaal and Maniatis 1999). We propose two additional roles for the observed purine bias: (1) high purine bias of mRNAs can minimize the possibility of formation of dsRNA and/or secondary structures, and (2) purine tracts can be used as exonic splicing enhancers (ESEs).

There are a few reasons for the above proposal (2) for R-bias. In vertebrates, splicing is based on “exon definition” (Robberson et al. 1990; Berget 1995; Romfo et al. 2000; Collins and Penny 2006). Presumably, high purine bias and the abundance of pure-purine tracts in the mRNA of THEGs may enhance the splicing rate and efficiency. Purine-rich tracts serve as cis-acting sequences for serine/arginine-rich (SR) proteins of the splicing machinery. Some of the well-known ESEs are purine-rich elements. In certain cases these elements are even represented by a consensus sequence (GAR) n (Liu et al. 1998; Tacke and Manley 1999; Shcaal and Maniatis 1999; Caputi and Zahler 2001; Black 2003; Webb et al. 2005). These sequences are bound by the SR proteins ASF/SF2 and Tra2 of the splicing machinery. It should be mentioned that enhancement of the splicing efficiency is not restricted to alternative spliced pre-mRNA: ESEs might also promote constitutive splicing, and SR proteins are involved in both alternative and constitutive splicing (Bourgeois et al. 2004; Cazalla et al. 2005; Ibrahim et al. 2005). Thus, abundance of the purine tracts might enable an increased rate and efficiency of splicing in these highly expressed genes compared to genes with lower expression.

The Meaning of the Lower Content of Pyrimidine Tracts Within Exons of THEG Genes Compared to the Control Group

The simplest explanation is that the observed pattern resulted mainly from the coding requirements and that the sequences of THEG proteins include fewer hydrophobic amino acids (that are encoded by pyrimidine-biased codons) than average proteins. We suggest that the relatively low frequency of pyrimidine tracts within THEG mRNAs serves two additional roles.

Reducing the formation of secondary structures

This might result from interaction between pyrimidine tracts and the highly abundant purine tracts within THEG mRNA (G.U or A.C non–Watson–Crick pairs can be formed in some circumstances [Meroueh and Chow 1999]). Thus, even if the bases in one tract are not perfectly complementary to the bases in the other tract, undesirable secondary structures might be formed between poly(R) and poly(Y) tracts.

Lowering the rate of THEG exon splicing suppression

In vertebrates, splicing is based mainly on exon definition (Robberson et al. 1990; Berget 1995; Collins and Penny 2006). PTBs binding to pyrimidine tracts in the exon had been suggested to be the main cause of exon splicing silencing (Wagner and Garcia-Blanco 2001), and several models that explain the silencing mechanisms were proposed (Wagner and Garcia-Blanco 2001; Oberstrass et al. 2005; Amir-Ahmady et al. 2005; Ibrahim et al. 2005). It seems reasonable that the frequency of Y-tracts will be lower in the exons of highly expressed genes than in the exons of genes that need more control of the expression on the splicing level.

Possible Functions for High Pyrimidine Bias and Low Abundance of Purine Tracts Within the Introns of THEG Genes in the Plant and Invertebrates

The pyrimidine bias of introns is especially pronounced in THEG genes and coincides with a lower percentage of purine in tracts. We suggest that this composition of THEG gene introns is an adaptation to high expression, in two of the following (not mutually exclusive) aspects.

(1) The foregoing specific intron composition might lower the chance of formation of RNA secondary structures by undesirable bonds between the intronic polypyrimidine tract located near the 3’ splice site, an essential cis-acting component of splicing (Ruskin and Green 1985; Roscigno et al. 1993; Coolidge et al. 1997), and the polypurine tract. This should be more important in species where splicing of (many or most) pre-mRNAs is based on “intron definition” (in invertebrates and A. thaliana, respectively).

(2) Many ESEs, the target sequences for the binding of SR proteins in exons, are purine-rich (Black 2003; Webb et al. 2005). There are reports that SR proteins can suppress splicing when bound to sequences located within introns (Kanopka et al. 1996; Dauksaite and Akusjarvi 2002; Ibrahim et al. 2005). In addition, sequences that are the binding targets for SR proteins are ESEs when located in exons, but when inserted to introns, they can cause inactivation of a 3′ splice site located downstream (Ibrahim et al. 2005). Therefore Maniatis and coworkers (Ibrahim et al. 2005) proposed that in constitutively spliced introns, SR protein binding sites should be rare. It is reasonable that in species with lower rates of alternative splicing, the abundance of sequences that are used as ESEs will be low in introns, especially in the introns of THEG genes.

The chicken and mammals have the lowest pyrimidine bias within THEG introns compared to all the other species examined. The elevated frequency of purines in the introns of THEG genes of these higher eukaryotes might enable higher rates of alternative splicing, by the inclusion (in certain cases) of parts of the introns in the mRNA. As the composition of all mRNAs is, on the average, R-biased, and THEG genes have an even higher R-bias than the average, intronic sequences with higher purine content have (in general) a higher probability of becoming a part of mature mRNAs, especially in THEG genes. For such events to occur, it is also required that other cis-components for the splicing machinery will be properly organized.

Final Remarks and Conclusions

We suggest that the R-Y composition of pre-mRNAs had influenced the evolution of splicing, and vice versa, was affected by the pressures resulting from evolutionary changes in the splicing machinery. As mRNAs tend to be, on the average, purine-biased, distinguishing between exonic and intronic sequences might be easier if intronic sequences become pyrimidine-biased. And indeed, the polypyrimidine tract located downstream from the branch point, near the 3′ splice site, is an essential cis-component of splicing. The pyrimidine bias of the introns, which might enable a high efficiency of splicing, is conserved in species with a high use of “intron definition” in the splicing process (A. thaliana and invertebrates). In these species, the introns of highly expressed THEG genes seem to be “superintrons,” better adapted to proper recognition by the splicing machinery components without unnecessary disturbances. Recently, Andolfatto (2005) adopted the McDonald-Kreitman test (1991) for noncoding sequences and estimated that positive selection affected ∼20% nucleotides in introns. If the constraint intronic nucleotides (estimated relative to fourfold synonymous sites) are also considered, then more than 50% of the nucleotides in the introns are considered to be functionally relevant.

The switch from high use of “intron definition” in the splicing process in invertebrates to “exon definition” in the vertebrates (due to the increasing intron length) had influenced the R-Y compositional design of all pre-mRNAs, in general, and the pre-mRNAs of the highly expressed genes, in particular. The mRNAs of highly expressed genes should have a higher abundance of ESEs and a lower level of exonic splicing silencers, despite the constraints of the coding role. It seems very reasonable that purine-biased sequences were chosen to be cis-acting ESEs (binding targets of the SR proteins), due to their high abundance in the mRNAs of RPs, HSPs, and ARSs (proteins that are all highly expressed due to their key role in the cell metabolism). There could also have been a need to lower the amount of target sequences for the binding of SR proteins in the introns, and this trend might be even more important for THEG genes. We believe that the pre-mRNA R-Y composition of both exons and introns, especially the distribution of homotracts, is an important layer of gene organization, which strongly influences the expression level of eukaryotic genes.