Introduction

In some organisms, selection can act on gene sequences to improve either speed or fidelity of translation (translational selection) and this selection pressure is usually stronger for highly expressed genes. The classic example is codon usage bias, where certain codons are used in genes expressed at a high level, and these codons usually correspond to the most abundant tRNA molecules. Population genetic theory predicts this effect of selection will be stronger in species with a large effective population size. Hence, a strong correlation between expression and codon bias is observed in bacteria (Gouy and Gautier 1982; Grantham et al. 1981; Ikemura 1981a, b), yeast (Bennetzen and Hall 1982; Ikemura 1982), flies, and nematodes (Duret and Mouchiroud 1999).

The expression level of a gene can be correlated with a number of parameters of gene structure. Intron length, for example, is negatively correlated with expression in both H. sapiens and C. elegans (Castillo-Davis et al. 2002). Comeron (2004) investigated human gene structure and composition. Positive correlations between intron density (number of introns per kilobase of coding sequence) and expression level and expression breadth (number of tissues in which the gene is expressed) were reported. Considering that the length of individual introns is under selection in highly expressed genes, it is expected that the number of introns is also under selection. Both would reduce the cost and time required to transcribe a highly expressed gene. The positive correlation between intron density and expression in humans is then counterintuitive. This suggests that other factors are working to increase the density of introns found in highly expressed genes in the human genome. This result has not been checked in more compact genomes such as the fly or worm.

In addition to the cost of transcribing a DNA sequence, removal of introns from the pre-mRNA also incurs an energy and time cost to the cell. Before the introns can be removed they must be recognized by the splicing machinery. This is mediated through recognition of the splice sites. The D. melanogaster donor splice site is characterized by the consensus sequence G|GU(A/G)AGU (Weir and Rice 2004). The highly conserved dinucleotides at the start of the intron are underlined. The donor splice site is complementary to the 5′ end of U1 small nuclear (sn)RNA and interaction is required for splicing. The splicing factor U2AF recognizes the acceptor splice site, characterized by the consensus UNCAG (Weir and Rice 2004). Here the last two nucleotides of the intron are underlined. In C. elegans the donor and acceptor splice site consensus sequences are AG|GUAAGU and UUUCAG|R (Fields 1990).

Do relationships exist among splice site signals, gene structure parameters, and gene function parameters such as expression? Splice sites flanking long introns display high information content in invertebrates. The first studies investigated a small number of introns in C. elegans (Fields 1990) and D. melanogaster (Mount et al. 1992). These suggested that strong splice sites could help the spliceosome to recognize pairs of splice sites across long distances. The observation has been repeated using the entire D. melanogaster genome (Weir and Rice 2004). The information content of each base surrounding the splice site was calculated and groups of progressively longer introns in the D. melanogaster genome were examined. As the length of the introns increased, the number of bases contributing information to the splice site increased. Recently this study was expanded (Weir et al. 2006). The cumulative or total information for a splice site was calculated and again showed an increase with intron length. Interestingly an increase in information content was also observed when the splice sites were near very short introns and exons, although the effect was subtle with regard to short introns. To the authors, this suggested crowding problems for RNA binding molecules and that a stronger interaction of the spliceosome is required in order for proper splicing to occur. Nonflanking intron and exon lengths were also investigated and were shown to have an effect on the strength of a splice site. This suggests that perhaps blocks of splice sites might interact with the spliceosome machinery.

Splice site strength varies between sites flanking constitutive and alternatively spliced introns and exons in humans (Zheng et al. 2005). Constitutive splice sites have a stronger signal compared to those used in alternative splicing. A link with expression was also investigated. Cases where there is one donor splice site and a choice between an acceptor site (AA alternative acceptor) or vice versa (AD alternative donor) were examined. Those used most frequently according to EST data showed stronger signals. The stronger splice sites are preferred in >50% of the cases.

In this paper, we examine the relationship between intron density and expression in both C. elegans and D. melanogaster. We also investigate the relationships between splice site strength and intron length and intron density in the two species. Finally, we examine the effects of alternative splicing on these relationships.

Methods

Sequences

Sequence data were taken from the NCBI FTP site (ftp.ncbi.nlm.nih.gov). GenBank files were downloaded for all chromosomes for C. elegans and D. melanogaster. These are annotated with intron-exon coordinates for 20,466 and 9590 pre-mRNA transcripts for the two species, respectively. Subsequently the genes were divided into those producing one transcript (single product dataset) and those producing multiple alternative transcripts (multiple product dataset). The single product dataset for C. elegans contained 17,176 transcripts, while that for D. melanogaster contained 5217. The total numbers of introns for the two species were 82,211 and 16,655, respectively.

Splice Site Strength

The strength of a splice site was calculated following Zheng et al. (2005) using a position specific scoring-based approach. A set of 450 validated D. melanogaster splice sites and flanking sequences was taken from the Berkeley Drosophila Genome Project (www.fruitfly.org). These have been widely used for testing splice site prediction software and so provide a good reference set. Similarly the Sanger Institute provides splice site data for 8192 C. elegans introns (www.sanger.ac.uk). These test sequences were aligned at the splice site and the frequency of each base at each position was calculated. The following equation was then used to score each position of the splice site.

$$ {score = {\sum\limits_i^{} {\log {{F(Xi)} \over {F(X)}}} }} $$

\( F(Xi) \) is the frequency of the nucleotide at position i, while \( F(X) \) is the background frequency of that nucleotide. The background nucleotide composition for D. melanogaster is 60% AT and 66% AT for C. elegans. Donor splice site positions were taken from −3 to +7 (3 exon bases, 7 intron bases) and from position −26 to position +2 for acceptor splice sites (26 intron bases, 2 exon bases) as by Rogan et al. (1998). The scores are calculated by summing the contributions for all nucleotide positions to give separate acceptor and donor site scores. These were also combined by adding them together, since splice sites are recognized as pairs.

Gene Expression

Gene expression levels were estimated in two ways: EST counts and codon usage bias. The relative abundance of ESTs in a sequence database can be used to estimate expression levels (Bortoluzzi and Danieli 1999). EST databases were taken from the NCBI FTP site; 273,150 EST sequences were available for D. melanogaster and 273,860 for C. elegans. These were compared to coding sequences using BLASTN (Altschul et al. 1997). Threshold values were set to allow high-scoring pairs (HSPs) >400 nucleotides with >95% identity to be accepted as matches. If identity exceeded 98%, HSPs of 100–400 nucleotides were also accepted. The correlation between gene expression levels from microarray data sets and estimates from EST counts were examined by Munoz et al. (2004). They suggest that length should be accounted for when EST data are being used. They recommend dividing all EST counts by sequence length. One reason is that longer mRNA sequences take more time to decay and so are overrepresented in the EST databases. When this is done the correlation between microarray data and EST measures increases. Any sequences lacking any EST hits were excluded from any analysis using EST as a measure of expression.

We also used the Codon Adaptation Index (CAI; Sharp and Li 1987) as a measure of codon usage bias, which is an accepted predictor of expression levels in both species (Duret and Mouchiroud 1999). We used the CAI program of the EMBOSS package (Rice et al. 2000). The CAI program comes with a reference set of highly expressed genes from a range of species, which are used to generate a table of codon frequencies preferentially used by highly expressed genes. The codon usage of each gene is then compared to this reference set. CAI scores range from 0 to 1.0, where a high CAI score indicates a similar codon usage pattern to the reference set of highly expressed genes. A low CAI score indicates a gene is expressed at a low level.

Results

Intron Density Versus Expression Level

C. elegans

A weak negative relationship between the expression of a gene and the length of its coding sequence was previously reported in H. sapiens, using SAGE data, by Urrutia and Hurst (2003). A similar analysis was carried out by Comeron (2004) based on microarray data. Using Spearman correlation coefficients we observe the same effect in C. elegans (Table 1). Expression levels measured using EST data show a stronger negative correlation with coding sequence length compared to using CAI. We also observe a positive correlation between intron number and CDS length (Table 1), as expected. When investigating the relationship between intron number and expression this must be accounted for. We therefore used intron density (number of introns per kilobase of coding sequence). We found a negative correlation between intron density and expression, using both EST and CAI data (Table 1). This is opposite to the effect observed in H. sapiens by Comeron (2004).

Table 1 Correlations between gene expression levels and gene parameters

The Spearman rank correlation coefficents in Table 1 were calculated using all genes. In Fig. 1 the data are subdivided into three groups of equal size based on gene expression level. This shows that highly expressed genes tend toward reduced intron density. Subdividing the data into three groups of equal size based on coding sequence length also showed significant negative correlations between expression and intron density within these groups (data not shown; correlation coefficients ranged from −0.180 to −0.269; p < 0.001 for each).

Fig. 1
figure 1

Expression and intron density. Expression plotted against intron density (number of introns per kilobase coding sequence). A C. elegans: data are divided into three equal-sized groups (N = 5725) based on expression. Only results for CAI are shown. Bars represent the standard error of the mean. B D. melanogaster: details as for C. elegans

D. melanogaster

In D. melanogaster the relationships follow the same trend as in C. elegans, when using CAI data (Table 1). A negative correlation exists between coding sequence length and expression level. Intron number correlates positively with coding sequence length. As observed in C. elegans, intron density also correlates negatively with expression. Subdividing the data into three equal-sized bins according to sequence length gave the same trends (data not shown: R ranges from −0.278 [p < 0.001] for the shortest sequences to −0.360 [p < 0.001] for the longest). Using EST data, however, the correlation between intron density and expression level, is positive although close to zero (0.057).

Splice Site Strength Versus Intron Length

We measured splice site strength in a fixed window around each splice site, as done by Zheng et al. (2005). Acceptor and donor splice site strengths were examined separately and then summed to give total splice site strength. Figure 2 shows that all three measures increase with intron length in both D. melanogaster and C. elegans. Spearman rank correlation coefficients were calculated for all genes in D. melanogaster (acceptor score R = 0.110, p < 0.001; donor score R = 0.164, p < 0.001; combined score R = 0.183, p < 0.001) and C. elegans (acceptor score R = 0.110, p < 0.001; donor score R = 0.164, p < 0.001; combined score R = 0.183, p < 0.001).

Fig. 2
figure 2

Intron length and splice site strength. A Intron length versus splice site quality for C. elegans. Acceptor and donor scores are plotted separately (left panel) and then combined (right panel). Data are divided into six equal-sized groups based on intron length. Intron length: group 1, <57 nt; group 2, 58–60 nt; group 3, 61–66 nt; group 4, 67–100 nt; group 5, 101–363 nt; group 6, >363 nt. B Intron length versus splice site quality for D. melanogaster: details as above. Intron length: group 1, <46 nt; group 2, 47–50 nt; group 3, 51–62 nt; group 4, 63–180 nt; group 5, 181–466 nt; group 6, >467 nt

Intron Density Versus Splice Site Strength

The information content surrounding a splice site is also weakly dependent on the lengths of nonflanking introns and exons (Weir et al. 2006). Here we investigate the relationship between intron density and splice site strength. It is already known that constitutive splice sites have stronger splicing signals than alternative ones (Zheng et al. 2005). Therefore, we initially only looked at genes annotated with a single transcript (single product dataset). The data were binned by intron length, as splice site strength increases with intron length, as already shown in Fig. 2. Then each intron length group was divided into groups of increasing expression. Correlations were calculated within these groups and the results are shown graphically in Figs. 3 and 4.

Fig. 3
figure 3

C. elegans: splice site strength and intron density. We separated our data into three expression groups (across) and intron length groups (down). Expression is measured using CAI. Bars represent the standard error of the mean. Spearman correlation coefficients are calculated using all genes within each group. Significant negative correlation coefficients are indicated by ***. They were observed across all short intron groups (R = −0.047, R = −0.101, R = −0.100). The other significant correlations are for medium introns–medium expression (R = −0.078), medium introns–high expression (R = −0.105), and long introns–high expression (R = −0.053). These are all significant after Bonferroni correction for multiple tests

Fig. 4
figure 4

D. melanogaster: splice site strength and intron density. Details as for Fig. 3. A weak negative correlation is observed within the three short intron groups (top three panels), across all expression levels. These correlations are significant at p < 0.001, after correction for multiple tests (low expression, R = −0.086; medium expression, R = −0.057; high expression, R = −0.104). Within highly expressed, medium-length introns a significant negative correlation is also observed (R = −0.107)

A negative correlation between splice site strength and intron density is seen in short introns across all expression levels in both C. elegans (Fig. 3; top three panels) and D. melanogaster (Fig. 4; top three panels). The correlations are weak but significant, with coefficients ranging from R = −0.057 to −0.104 in D. melanogaster and from −0.047 to −0.101 in C. elegans. In C. elegans a negative correlation between intron density and splice site score is also observed within all highly expressed genes (three right-hand panels in Fig. 3), irrespective of intron length. Here the correlations range from R = 0.053 to 0.105 and are all significant, but the weakest is seen within long introns (bottom-right panel in Fig. 3). In D. melanogaster the only significant correlation outside of the top row is in the Medium Intron Length and High Expression group (middle-right panel in Fig. 4).

The above correlations between splice site strength and intron density are based on our single product dataset (this will now be referred to as our initial analysis). In principle this contains constitutive splice sites only. Based on results from Zheng et al. (2005) this removes variation in splice site score due to alternative splicing. Our multiple product dataset contains a mixture of constitutive and alternative splice sites. Examining these constitutive splice sites should show a similar effect between intron density and splice site strength, as observed in our initial analysis. Here we just report the overall correlation for all genes in each of the groups in the nine panels in Figs. 3 and 4. In C. elegans (Table 2) we find two significant results from the short intron groups. The correlation coefficients are stronger compared to the equivalent panels in Fig. 3, from the initial analysis. All other groups show a correlation close to zero. In D. melanogaster (Table 3) there are three significant results that withstand a Bonferroni correction for multiple tests. These are found in small introns expressed at a medium level and medium length introns expressed at a high level. Once again the correlation coefficients are stronger than those from the initial analysis shown in Fig. 4.

Table 2 Splice quality versus intron density for C. elegans
Table 3 Splice quality versus intron density for D. melanogaste r

Some of the groups that were significant in our initial analysis are now not significant. There are two explanations for these differences. First, our single product dataset is composed of cases where one gene produces one transcript. Our multiple product dataset consists of genes producing multiple transcripts. This affects the intron density of these transcripts. Comparing the intron density of the transcripts in these two groups, we observe a higher intron density within our multiple product dataset compared to our single product dataset (Fig. 5). The median intron density of alternative transcripts in D. melanogaster is 2.6 introns per kilobase of coding sequence, compared to 2.1 introns per kilobase in the single product dataset (Mann-Whitney test p < 0.001). C. elegans also showed a significant difference in intron density between single and multiple product datasets (median values: single product dataset, 4.27; multiple product dataset, 4.42; p < 0.001).

Fig. 5
figure 5

Intron density of single and multiple product datasets. Distributions of intron density in A C. elegans and B D. melanogaster

The second reason is due to constitutive splice sites from the multiple product dataset having stronger splice sites than the sites from the single product dataset. We compared the strength of splice sites between our single and our multiple product datasets (Table 4). Constitutive acceptor and donor sites in our multiple product dataset have the highest splice scores, followed by those found in our single product dataset. Alternative splice sites have the weakest splice scores, as previously noticed by Zheng et al. (2005).

Table 4 Variation in splice site strength across different datasets

Discussion

A number of gene parameters are correlated with gene expression levels. A positive correlation between the expression level of a gene and the intron density of that gene was reported by Comeron (2004) in H. sapiens. This was surprising given the expectation that highly expressed genes will be under selection to reduce the cost and time needed for transcription to occur. This suggests that other factors play a role in determining the density of introns found in human genes. We find the opposite effect, however, in C. elegans and D. melanogaster (Fig. 1). This supports a simpler model of selection acting to reduce intron content, both length and number, in highly expressed genes, in these organisms.

It has been shown that the strength of splice sites in D. melanogaster increases with intron length (Mount et al. 1992; Weir et al. 2006; Weir and Rice 2004). We show the expected correlation between intron length and splice site strength in both species. This is the first analysis of splice sites in the full genome of C. elegans. We found stronger sites in long introns measured from position −3 to position +7 for the 5′ or donor splice site and −26 to +2 for the 3′ or acceptor splice site (Fig. 2).

We investigated the strength of the splice signal and show a negative correlation between intron density and the strength of splice sites. It is known that constitutive splice sites are of a higher strength compared to those used in alternative transcripts (Zheng et al. 2005). We initially removed all genes annotated with more than one mRNA transcript to remove this effect. Intron length clearly has an effect as already investigated here and in previous studies in other genomes. We removed the effect of length and expression by binning the genes into groups according to intron length and gene expression level. Weaker splice sites are associated with high densities of introns. The effect is evident in short introns in both species and is strongest in those expressed at high levels. We also took all constitutive splice sites from our multiple product dataset and analyzed splice site strength and intron density.

Weir and Rice (2004) reported an increase in splice strength in longer introns. In a more recent analysis they also observed this in short introns (Weir et al. 2006), though the effect was weak. They propose that the stronger signal is due to a crowding effect and that small introns require the stronger signal for proper interaction with the spliceosome machinery. From this analysis, splice site strength also varies within short introns. When the intron is short, stronger splice sites tend to be found in less intron dense genes. Possibly, short dense introns are recognized more easily due to reduced intervening sequence and there is less of a chance for cryptic splice sites to occur. This also suggests that splice sites may not be recognized in isolation and single splice site interactions could be aided by neighboring ones. This was suggested by Weir et al. (2006) to explain nonflanking intron lengths also having an effect on splice site strength. Our analysis suggests that densely packed splice sites are recognized more easily by the spliceosome compared to those farther apart and are under less pressure to contain strong signals.

This negative correlation between splice strength and intron density is also seen in the highly expressed groups in both species. This is particularly evident in C. elegans. Intron recognition has been shown to be a more efficient mechanism of splicing than exon recognition (Fox-Walsh et al. 2005) and so is expected to be the main mechanism used in highly expressed genes. This gives a further advantage to reducing intron length. A stronger signal in short introns which are also in less dense genes could ensure faster recognition by the spliceosome and possibly allow cotranscriptional splicing, meaning that introns are removed while transcription is still ongoing. This, again, would speed up the process of producing a complete mature mRNA transcript ready for translation. In principle, the mechanism that our results suggest could also explain the unexpected positive correlation between intron density and expression in human genes. A group of high-density, short introns is recognized more efficiently and so this structure is selected for in highly expressed genes in human.