Introduction

Non-coding sequences make up about 80% of the 120-Mb euchromatic part of the Drosophila melanogaster genome (Adams et al. 2000). Several studies comparing D. melanogaster and its close relative D. simulans have shown that non-coding DNA is more highly constrained on average than synonymous sites (Bergman and Kreitman 2001; Halligan et al. 2004; Andolfatto 2005; Haddrill et al. 2005; Marais et al. 2005; Halligan and Keightley 2006; Casillas et al. 2007; Haddrill et al. 2008), suggesting that much non-protein-coding DNA in Drosophila is functional. Similar studies on mammalian genomes suggest that a smaller fraction of non-coding DNA is functional (Birney et al. 2007). One of the more surprising findings in Drosophila has been a significant negative correlation between intron length and intron divergence between D. melanogaster and D. simulans (Parsch 2003; Marais et al. 2005; Haddrill et al. 2005). These findings were confirmed by a whole-genome study of D. melanogaster and D. simulans, which showed that the level of selective constraint is positively correlated with intronic as well as intergenic sequence length (Halligan and Keightley 2006). Another observation is that first introns generally have a higher frequency of conserved regulatory elements in D. melanogaster (Duret 2001) and in mammals (Majewski and Ott 2002), although mammalian and Drosophila introns may be evolving differently. First introns are also longer than other introns in D. melanogaster (Duret 2001; Marais et al. 2005; Bradnam and Korf 2008). The relationship between intron divergence and intron length may thus be affected by the position of introns; although, Haddrill et al. (2005) found that mean divergence did not differ between first and non-first introns within short- and long-intron size categories. Levels and patterns of constraint on intergenic sequences appear to be similar to those on long introns (Bergman and Kreitman 2001; Andolfatto 2005; Halligan and Keightley 2006).

As we have begun to see that non-coding sequences are under extensive selective constraints, questions arise as to what functions they might have. Roles in pre-mRNA secondary structure (Kirby et al. 1995; Leicht et al. 1995; Chen and Stephan 2003, Rogic et al. 2008), gene regulation (Arnone and Davidson 1997; Hardison 2000; Parsch 2004) or RNA editing (Reenan 2005) have been suggested, and there is increasing experimental evidence for these (Birney et al. 2007). More recently, it has also been shown that natural selection on coding and non-coding DNA sequences is related to nucleosome organization (Warnecke et al. 2008; Babbitt and Kim 2008; Kaplan et al. 2009).

It is important to determine whether these patterns of sequence evolution apply more generally. This can be done using comparisons of species that are sufficiently closely related that their non-coding sequences can reliably be aligned (Pollard et al. 2004), but are distant enough that there is some power to detect patterns of divergence. Unfortunately, the 12 species of Drosophila that have been sequenced (Clark et al. 2007) are far from ideal for this purpose. For this reason, we have chosen to compare the close relatives D. miranda and D. pseudoobscura. The latter is one of the 12 sequenced Drosophila genomes (Richards et al. 2005), and its approximate divergence time from D. miranda is 2 My (Barrio et al. 1992), with an average divergence at 4-fold synonymous sites of 3.6% (Bachtrog and Andolfatto 2006). A high rate of chromosomal rearrangements has been found between these two species (Bartolomé and Charlesworth 2006a). Recently, long introns were shown to be less diverged than short introns between D. pseudoobscura and D. miranda (Bachtrog and Andolfatto 2006), consistent with the results for D. melanogaster and D. simulans.

Other patterns that can be explored with this comparison are as follows. It has been shown in comparisons of D. melanogaster with its relatives that genes with high levels of non-synonymous divergence have lower codon usage bias, possibly caused by selective interference from positively selected non-synonymous mutations, or because of a general reduction in selective constraints on these genes (Betancourt and Presgraves 2002; Marais et al. 2004; Bierne and Eyre-Walker 2006; Andolfatto 2007); the same has been found in D. pseudoobscura and D. miranda for a set of 91 X-linked loci (Bachtrog 2008). However, using a molecular-level evolutionary simulation, Drummond and Wilke (2008) showed that this relationship can be fully explained by selection against the toxicity of misfolded proteins induced by mistranslation. D. miranda coding sequences have been shown to be under weak selection for codon usage (Bartolomé and Charlesworth 2006b; Bachtrog 2007). Gene expression levels are highly correlated with optimal codon usage (Duret and Mouchiroud 1999), which has been interpreted as evidence for selection leading to highly expressed genes having optimal codons for more effective or accurate translation. It is therefore important to correct for the effects of gene expression on patterns of codon usage and sequence divergence. This has not always been done, although Marais et al. (2004) found that correcting for gene expression levels estimated from microarray data did not alter the negative correlation between codon usage and divergence between D. melanogaster and D. simulans. Since expression data are available for D. pseudoobscura, the miranda-pseudoobscura comparison offers an opportunity to examine the questions raised by the studies of D. melanogaster.

The aim of this article is to analyse alignments of the sequences of a set of D. miranda BAC clones and the corresponding parts of the D. pseudoobscura genome sequences, and to determine whether the relationships described above hold for these two species.

Methods

BAC libraries for D. miranda were created by Dr. Xulio Maside (University of Edinburgh) and the Children’s Hospital Oakland Research Institute (CHORI) (Bachtrog et al. 2008; Table 1), and sequenced at the Wellcome Trust Sanger Institute. We aligned these sequences to the corresponding sequences of the D. pseudoobscura genome, for the purpose of studying sequence divergence between the two species.

Table 1 Identification of D. pseudoobscura orthologs

BAC Sequencing

Fly Material

As described in Bachtrog et al. (2008), high molecular weight DNA suitable for creating BAC libraries was isolated from adult males from a D. miranda isofemale line (MSH22: Yi and Charlesworth 2000), which has been maintained in vial cultures for more than 10 years.

BAC Library

A D. miranda BAC library was produced by the CHORI in a pTARBAC6 vector. The library was tested with the following amplicons by Dr. Mark Dorris in Edinburgh, to localize a subset of BAC sequences within the D. miranda genome: 1, DdC; 2, dpp; 3, Eno; 4, Gpdh; 5, bcd; 6, Gld; 7, hb; 8, Rp49; 9, Est-5B; 10, Gapdh2; 11, swallow; 12, sesB. The sequences for these probes were provided by Dr Carolina Bartolomé, as described in Bartolomé et al. (2005). Positive colonies established by individual amplification using the above amplicons were picked onto agar stabs and sequenced at the Wellcome Trust Sanger Institute.

There is no full sequence for BAC 10, because it contains a large number of repeats; BAC 9 contains a 120-bp region that could not be sequenced because it is surrounded by long runs of G’s and C’s; and BAC 6 is made of two contigs separated by a 1225-bp gap, possibly with a high A/T content. The repeats in BAC10 may be associated with the transposition of this region from XL to XR in D. pseudoobscura (Bartolomé and Charlesworth 2006a).

Characterization of the BAC Clones

We used 12 BAC sequences (~200-kb each) from chromosomes 2, 4, XL and XR of D. miranda (Table 1). We excluded the neo-X and neo-Y chromosome pair from our study since they are expected to have unusual evolutionary properties, and have been the subject of a recent analysis by Bachtrog et al. (2008). We located orthologous sequences by BLASTing 200 bp from every 1200 bp of the D. miranda BAC sequences against the repeat-masked D. pseudoobscura genome (release dp3 from UCSC genome browser; Richards et al. 2005). In the masked version of the genome, repeats from RepeatMasker and Tandem Repeats Finder are masked. We plotted the location on dp3 contigs against the location on the BAC sequence to identify sections that were co-linear, and all contiguous BLAST hits were grouped into fragments. The homologous sequences were aligned with MAVID (Bray and Pachter 2004), all alignments were checked by eye to identify any possibly non-homologous sections or misalignments. No problems were found with any of the alignments. Approximately 95% of D. miranda sites were alignable, which is sufficiently high to measure the rate of evolution for the vast majority of sites, therefore our results should not be very biased. We extracted the coding, intronic and intergenic alignments for 192 genes from the large BAC alignments by mapping the D. pseudoobscura annotation onto the alignments. Some genes were found to overlap with each other in two ways: either a whole gene was included in the large intron of another gene, or two genes were overlapping for most of the sequences. Analyses were done excluding these overlapping genes to avoid data duplication, and to ensure that all sequence identified as intergenic or intronic were completely non-coding. Introns and intergenic sequences were then realigned using MCALIGN2 (Wang et al. 2006), using an insertion–deletion frequency model previously defined for Drosophila intronic DNA (Keightley and Johnson 2004). Sequences were deposited into GenBank under accession numbers FJ821025–FJ821035.

Coding Sequences

Sequences with internal stop codons in D. miranda were excluded, since these may represent sequencing or alignment errors, or genes that have lost their function in D. miranda. Coding sequences were checked against the Flybase coding sequences of D. pseudoobscura to ensure that the annotations agreed. After rejecting two genes that had internal stop codons, five incomplete coding sequences and three overlapping genes, 182 coding sequences remained in the final dataset.

For the analyses of coding sequences, we estimated the frequency of optimal codons (F op: determined using Codonw (http://codonw.sourceforge.net) from the D. pseudoobscura preferred codon table (Vicario et al. 2007), and d N/d S (PAML: Yang 1997) and K a/K s (using the method of Comeron 1995 implemented in Gestimator: libsequence C++ library, Thornton 2003) ratios. K a and d N measure the rate of non-synonymous substitutions, and K s and d S estimate the rate of synonymous substitutions. Gene length was calculated from the start of the first codon to the end of the last codon, including introns but leaving out UTRs. We also used gene expression as a covariate, so we used expression data for D. pseudoobscura from the GEO database. This database was generated by Zhang et al. (2007), who recently performed microarray experiments to investigate sex-biased expression of orthologues and species-restricted genes in Drosophila (data accessible at NCBI GEO database, Edgar et al. 2002, accession GSE6640). We used the log2 transformed signal intensities after VSN normalization, which were available for 172 genes in our dataset. These data were available for five males and four females, so we calculated the weighted average of these values for each gene.

Introns

In order to avoid regions where constraint due to splicing mechanisms is already documented (Halligan and Keightley 2006), we removed 8 bp from the 5′-end and 30 bp from the 3′-end of introns, which correspond to the splice sites of introns. Leaving these sites in the intron sequences would weaken any correlation with length, since these always contribute the same number of bases to any other intron, and so proportionally less to longer introns. We discarded 11 introns that overlapped with coding sequence from other genes, so that the final dataset comprises 406 introns. We estimated GC content, intron length and divergence between D. miranda and D. pseudoobscura using the Jukes–Cantor correction (Jukes and Cantor 1969). We split introns into three length categories: short introns between 51 and 80 bp (284 introns), long introns between 81 and 500 bp (52 introns) and very long introns over 500 bp (70 introns). This meant that we omitted 20 introns of less than 51 bp.

In order to determine intron lengths in the D. pseudoobscura genome, we also extracted the start and end positions of introns as well as their position in the gene, from the DroSpeGe database. There are 5,986 first introns and 12,961 non-first introns in this dataset.

Intergenic Sequences

We defined intergenic sequences as those regions between the ends and the starts of coding sequences. We discarded 5 intergenic sequences that overlapped with coding sequence, giving 167 intergenic sequences in the final dataset. UTRs are unannotated in the D. pseudoobscura genome, so we analysed separately the start, middle and end of intergenic sequences. Evolutionary divergences were obtained for the whole sequence, as well as for the edges and the centre of the sequence, to detect the potential effects of UTRs at the edges of these intergenic sequences. Halligan and Keightley (2006) found the average length of 5′-UTRs and 3′-UTRs in D. melanogaster to be 148- and 280-bp long, respectively, so we used these mean values to define the edges of intergenic sequences. Intergenic sequences were all longer than 80 bp, so they were split into only two length categories, long and very long, as explained in the introns section.

Analyses

Analyses were done using the R statistical package (http://www.r-project.org/). 95% confidence intervals for partial Pearson correlation coefficients were obtained by bootstrapping 1,000 times by sequence for the coding sequence and the intergenic sequence datasets. For the intron dataset, however, an ANOVA showed significant variation for intronic divergence among genes (P = 0.009), indicating that divergence values for introns within a same gene are not independent. Thus, for this dataset we bootstrapped 1,000 times by gene. Wilcoxon two-sample tests were performed to compare means between categories of sequences (e.g. long vs. short and X-linked vs. autosomal). Paired bootstrap tests were performed to test the difference in mean divergence between sequence categories and 95% confidence intervals were calculated by bootstrapping 100,000 times by gene.

Results

Relationship Between Non-Coding Sequence GC Content and Divergence

Haddrill et al. (2005) found a significantly negative correlation between intron divergence and GC content, which may influence the relationship between intron length and divergence. Here, we found that intron divergence and GC content are negatively but not significantly correlated (Fig. 1a) (D. pse: Spearman r S = −0.091, P = 0.068; D. mir: Spearman r S = −0.084, P = 0.093). After accounting for intron length, the partial Pearson correlation coefficient for divergence and GC content is r = −0.051, 95% CI = [−0.151; 0.047]. Although it is not significant, the relationship is in the same direction as found by Haddrill et al. (2005), so that a small effect cannot be excluded. There is a negative but non-significant correlation between intergenic sequence divergence and GC content; the partial correlation coefficient, after accounting for intergenic sequence length, is Pearson r = −0.012, 95% CI = [−0.182; 0.159].

Fig. 1
figure 1

a Plot of intron divergence (with Jukes–Cantor multiple hit correction) against GC content. Solid circles represent first introns and open circles represent non-first introns. b Plot of intron divergence (with Jukes–Cantor multiple hit correction) against intron length on a log scale. Solid circles represent first introns and open circles represent non-first introns

Relationship Between Non-Coding Sequence Length and Divergence

The first relationship that we investigated in this context was the correlation between non-coding sequence length and divergence, since Haddrill et al. (2005), Halligan and Keightley (2006) and Bachtrog and Andolfatto (2006) all found a negative correlation between intron length and divergence using D. melanogaster and D. simulans, and D. pseudoobscura and D. miranda, respectively. This was also observed in our data, even after correcting for GC content (Fig. 1b) (Pearson r = −0.064, bootstrap by gene 95% CI = [−0.098; −0.039]). Using the same method, but accounting for gene expression this time, the correlation coefficient is Pearson r = −0.057, bootstrap by gene 95% CI = [−0.096; −0.037]). Gene expression is positively correlated with intron length (Spearman r S = 0.262, P = 0.002), as Marais et al. (2005) found for first introns in D. melanogaster, but there is no significant correlation with intron divergence (Spearman r S = 0.006, P = 0.909). Accounting for both GC content and gene expression, the correlation coefficient between intron length and divergence is Pearson r = −0.057, bootstrap by gene 95% CI = [−0.094; −0.034], so the negative correlation coefficient is still significant when both variables are accounted for. These correlation coefficients are lower than those for the melanogastersimulans comparison, possibly reflecting the smaller levels of divergence in our case, with correspondingly more noise in the estimates relative to the mean divergence.

The correlation coefficient between evolutionary divergence and intergenic sequence length, after accounting for GC content, is Pearson r = 0.1140, 95% CI = [−0.010; 0.244] (Fig. 2). The correlation is non-significantly different from zero, and the estimate is positive, and plausibly could be at most slightly negative. Halligan and Keightley (2006) found a significantly negative correlation between divergence and intergenic sequence length for their genome-wide melanogastersimulans comparison, so that our result could entirely be due to the small number of intergenic sequences surveyed in this study. There is some variation in constraint levels in intergenic sequences, since divergence is smaller at the edges of intergenic sequences (mean divergence 0.027) than in their middle (mean divergence 0.031) (one-sided Wilcoxon test, P value = 0.00357). This test suggests that the edges of intergenic sequences are more strongly constrained than the middle, possibly due to the presence of promoters or UTRs. Bachtrog and Andolfatto (2006) also found high levels of constraint (~30%) between D. pseudoobscura and D. miranda in intergenic sequences, a value that is probably due to the presence of UTRs.

Fig. 2
figure 2

Plot of intergenic sequence divergence with Jukes–Cantor multiple hit correction against intergenic sequence length on a log scale

Figure 3 shows the mean divergence levels of different categories of coding and non-coding sites. As expected, non-synonymous sites have a much lower divergence than all other classes (Wilcoxon tests, P ≤ 0.00265; P ≤ 2 × 10−5 from all paired bootstrap differences between the non-synonymous sites divergence and other divergences). Synonymous sites are more diverged than long and very long intergenic and non-first intronic sequences (Wilcoxon test, P ≤ 0.0172; P ≤ 2 × 10−5 from all paired bootstrap differences between the synonymous sites divergence and intron divergences), except for first introns, but the estimate of the mean divergence for first introns has large standard errors. Short first introns have a lower divergence than short non-first introns but the difference is not statistically significant (Wilcoxon test, P = 0.285). The difference in length between first and non-first introns is non-significant (Wilcoxon test, P = 0.18), with first introns being shorter than long introns, which contrasts with previous observations in D. melanogaster (Maroni 1994; Duret 2001; Bradnam and Korf 2008) and also with the whole D. pseudoobscura genome data, which shows that first introns are significantly longer than non-first introns (Wilcoxon test, P = 1.029−08). It is thus likely to be only a sample effect in our dataset. Short introns show higher divergence than long and very long introns, significantly so for very long introns using the paired bootstrap test (long: W = 7995, P = 0.336; very long: W = 10463, P = 0.491; P = 0.018 from all paired bootstrap differences between short intron divergence and very long intron divergence). The ratio of the mean divergence for all long introns over the mean for short introns is 0.652, which is similar to 0.636, the ratio found in the melanogastersimulans comparison (Haddrill et al. 2005). Divergence in short introns is not significantly different from synonymous sites (Wilcoxon test, P > 0.09).

Fig. 3
figure 3

Barplot of mean divergence (±SE) for different classes of sites. Rates were calculated using the Jukes–Cantor multiple hit correction. Short sequences are under 80 bp, long sequences are between 80 and 500 bp and very long sequences are over 500 bp

Comparison of X-Linked Versus Autosomal Loci

Differences between X-linked and autosomal loci have been observed in previous studies, that suggest faster non-synonymous and slower synonymous site evolution for X-linked loci (Singh et al. 2005; Larracuente et al. 2008; Vicoso et al. 2008), so we looked for the same pattern in our dataset. To account for the effect of gene length and gene expression, we separated the data for X-linked and autosomal loci into two length categories (with the cutoff gene length set at the median 1100 bp) and separately into two expression level categories (with the cutoff expression value set at the median 9.3). There is no significant difference in the rate of synonymous or non-synonymous substitutions between X-linked and autosomal coding sequences.

Relationship Between Codon Usage (F op) and d N

A negative correlation has previously been observed between codon usage (F op: frequency of preferred codons; Ikemura 1981) and the rate of non-synonymous substitutions (measured by d N or K a) between D. melanogaster and D. simulans (Betancourt and Presgraves 2002) and between D. pseudoobscura and D. miranda (Bachtrog 2008). This suggests that genes with fast-evolving protein sequences have lower codon usage bias, indicating that selection is less effective on codon usage bias in such genes. However, gene expression may affect this relationship, because highly expressed genes are usually more conserved. This should therefore be corrected for in such an analysis. Marais et al. (2004) have indeed shown that d N and expression level are negatively correlated in a dataset of 630 orthologous sequence pairs from D. melanogaster and D. yakuba; although, this correlation is weaker than that between d N and F op. Furthermore, Andolfatto (2007) suggested that the silent substitution rate should be accounted for when calculating the correlation between codon usage and the rate of non-synonymous substitutions because of the positive correlation between K a and K s (Comeron and Aguadé 1996). Our results also show a significant positive correlation between K a and K s (Spearman r S = 0.409, P = 1.55−8). Coalescence times within the species may vary between genes, e.g. because of local differences in recombination rates, with longer coalescence times associated with higher levels of within-species polymorphism for both synonymous and non-synonymous variants. Since within-species variation must contribute considerably to the observed levels of divergence between D. miranda and D. pseudoobscura, this may at least partly explain the positive correlation between K a and K s that we observe. However, this effect seems unlikely to be important for more distantly related species, where this correlation is also found (e.g. Andolfatto 2007; Drummond and Wilke 2008). Codon usage bias decreases with gene length in D. melanogaster (Duret and Mouchiroud 1999) and long proteins are expected to be disadvantageous (Moriyama and Powell 1998), so gene length is another factor to correct for in this analysis.

F op and d N are significantly negatively correlated (Fig. 4) when correcting for d S (D. pse: Pearson r = −0.415, bootstrap 95% CI = [−0.534; −0.282]; D. mir: Pearson r = −0.402, bootstrap 95% CI = [−0.538; −0.272]). After controlling for gene length, gene expression, and d S, there is still a significant negative correlation between F op and d N (D. pse: Pearson r = −0.360, bootstrap 95% CI = [−0.494; −0.199]; D. mir: Pearson r = −0.341, bootstrap 95% CI = [−0.467; −0.182]). Thus, as found in a previous study that did not correct for gene length, gene expression and d S (Bachtrog 2008), codon usage in D. pseudoobscura and D. miranda appears to decrease as the non-synonymous substitution rate increases, even after controlling for potential effects of gene length, gene expression and d S.

Fig. 4
figure 4

Plot of the frequency of optimal codons (F op ) for D. pseudoobscura against nonsynonymous divergence (d N) on a log scale

F op and d S are significantly positively correlated after correcting for d N, gene length and gene expression (D. pse: Pearson r = 0.430, bootstrap 95% CI = [0.239; 0.563]; D. mir: Pearson r = 0.415, bootstrap 95% CI = [0.239; 0.550]). However, this correlation is probably due to the influence of base composition on d S (Bierne and Eyre-Walker 2003), because it becomes non-significant when using K a and K s instead of d N and d S (D. pse: Pearson r = 0.145, bootstrap 95% CI = [−0.029; 0.293]; D. mir: Pearson r = 0.136, bootstrap 95% CI = [−0.034; 0.285]).

Another finding is the positive correlation between F op and coding sequence length after correcting for gene expression (Pearson r = 0.171, bootstrap 95% CI = [0.034; 0.309]), which contradicts results from Duret and Mouchiroud (1999), but is in agreement with the model of selection on translational accuracy (Drummond and Wilke 2008), according to which codon usage bias should be higher in genes encoding long proteins. However, this correlation becomes smaller and non-significant after correcting for d N (Pearson r = 0.127, bootstrap 95% CI = [−0.026; 0.269]). We also found a highly significant positive correlation between gene expression and codon usage after correcting for gene length and K s (Pearson r = 0.342, 95% bootstrap CI = [0.212; 0.466]), consistent with previous results (Shields et al. 1988; Moriyama and Powell 1998; Duret and Mouchiroud 1999) and the theory that highly expressed genes experience stronger selection on translational accuracy (Moriyama and Powell 1998; Drummond and Wilke 2008).

We also found that d N and gene expression are negatively correlated after accounting for gene length, d S and GC content (Pearson r = −0.217, 95% bootstrap CI = [−0.396; −0.023]), consistent with Marais et al. (2004) and Subramanian and Kumar (2004), who found that highly expressed proteins evolve slowly in yeast and in fruit flies, respectively. d S and gene expression are also negatively correlated, after accounting for d N, gene length and GC (Pearson r = −0.293, 95% bootstrap CI = [−0.471; −0.150]). These two relationships also agree with findings of Drummond and Wilke (2008), supporting their hypothesis of selection against misfolded proteins.

Discussion

Our study sheds new light on some important aspects of sequence evolution in Drosophila, using sequence comparisons between a set of more or less randomly chosen loci for a pair or closely related species in the obscura group, D. pseudoobscura and D. miranda. These species are sufficiently distantly related that there is some power to detect patterns of sequence evolution revealed by earlier studies, predominantly of the melanogaster group, unlike comparisons between D. pseudoobscura and D. persimilis, for which whole genome sequences are available (Birney et al. 2007). They are sufficiently close that alignments of non-coding sequences can be reliably performed, making them a useful tool for our purpose, in addition to the intrinsic importance of these species for other problems in evolutionary genetics, such as Y chromosome evolution (Bachtrog et al. 2008).

There are several limitations to our study. First, the distribution of intron lengths is highly skewed, with many more short introns than long introns, which reduces the power of correlation-based tests. Second, D. miranda and D. pseudoobscura are closely related and show signs of divergence caused by different ancestral polymorphic variants having becoming fixed independently in the two species, rather than by fixation of new mutations (Bartolomé et al. 2005). In addition, using only one sequence from each species, polymorphism and divergence are confounded, which will result in overestimates of true divergence levels and underestimation of the effects of purifying selection, since this has less influence on polymorphism than divergence (Akashi 1995; Charlesworth 1994). Finally, a variable that has not been taken into account in this study, due to a lack of detailed information, is the recombination rate. This has been shown to affect GC content (e.g. the positive correlation between recombination rate and GC content in large introns and intergenic regions; Marais et al. 2003) and the efficacy of selection strength (Comeron et al. 2008).

Non-Coding Sequence Length and Divergence

There is a negative correlation between intron length and intron divergence between D. melanogaster and D. simulans (Haddrill et al. 2005; Halligan and Keightley 2006), and long introns were found to be less constrained than short introns in a comparison of D. pseudoobscura and D. miranda (Bachtrog and Andolfatto 2006). This study confirms this observation. Studies using polymorphism data in addition to divergence data have also shown that long intron sequences are under purifying natural selection (Andolfatto 2005; Casillas et al. 2007; Haddrill et al. 2008), so this does not simply reflect differences in mutation rates between long and short introns.

Several possible explanations for the conservation of long non-coding sequences have been suggested. First, longer sequences may contain more cis-regulatory elements (Bergman et al. 2002; Emberly et al. 2003; Sironi et al. 2005). Another complementary explanation is based on the observation of a general mutational bias in favour of deletions in Drosophila (Petrov 2002). If a sequence is functionally constrained, deletion bias will be countered by selection and the sequence will be longer than non-conserved sequences. To explain the persistence of short poorly conserved sequences, we might argue that it is harder to fix deletions in short introns, because deletions are more likely to affect adjacent coding sequence. Furthermore, the need for a minimum intron size for correct splicing (Mount et al. 1992) suggests that short introns would merely be spacers between exons. This suggests that most of the non-coding DNA in the Drosophila genus has been conserved because it has a function.

First introns are longer (Duret 2001; Bradnam and Korf 2008) and possibly contain more regulatory elements than other introns in Drosophila melanogaster (Duret 2001). This pattern is not seen in our dataset, however, since we find that first introns are shorter than non-first introns, although short first introns have a lower divergence than non-first short introns. This is probably due to our only studying a limited set of genes, because when using the whole genome of D. pseudoobscura, first introns are significantly longer than non-first introns (data not shown).

Intergenic Sequences

As expected, the putative UTRs (edges of intergenic sequences that correspond to the average UTR lengths in D. melanogaster) have smaller divergence levels than the rest of the intergenic sequences, so they are under apparently more constraint than other non-coding sequence. Several studies have found that UTRs in D. melanogaster and D. simulans are under greater selective constraint than 4-fold-synonymous sites (Bergman and Kreitman 2001; Halligan et al. 2004; Andolfatto 2005; Halligan and Keightley 2006; Haddrill et al. 2008). As Casillas et al. (2007) point out, the indiscriminate use of any type of non-coding sequence as a neutral standard is dangerous, given this evidence for differences in constraint levels among different classes. It may be better to use 4-fold-synonymous sites or non-coding sequences that are thought to be under weak constraints, such as short introns.

Coding Sequence Patterns

One of the more interesting patterns that we recover in this study is a significant negative correlation between codon usage and non-synonymous substitution rates, even when the effects of gene expression levels, gene length and synonymous divergence are taken into account. Lower codon usage in fast-evolving genes can be explained in several ways. Hill–Robertson interference (1966) from strongly selected sites on the effects of weak selection for optimal codon usage at linked sites is frequently invoked (Betancourt and Presgraves 2002; Andolfatto 2007). This effect might be magnified in D. miranda by the evolution towards a lower overall codon usage bias (Bartolomé and Charlesworth 2006a, b; Bachtrog 2007), caused by a lower effective population size than for D. pseudoobscura. This is because selection for codon usage is becoming weaker in D. miranda and thus strong positive selection will override the effects of selection for codon usage all the more. Bierne and Eyre-Walker (2006) claimed that Hill–Robertson effects do not greatly affect codon bias, and they argued that the only likely alternative is that the strength of selection acting upon synonymous mutations is correlated with that acting upon non-synonymous mutations, presumably because of selection on translational accuracy. Genes that are under greater selective constraint will evolve slowly and need to be accurately translated. Relaxed selective constraint on fast-evolving genes would then lead to a lower selection for codon usage.

The positive correlation between F op and d S potentially could be generated by a larger contribution from ancestral divergence in regions with high rates of recombination and therefore higher codon usage bias (Marais et al. 2001). However, Andolfatto (2007) showed that the negative correlation between codon usage and synonymous divergence between D. melanogaster and D. simulans, usually interpreted as being caused by lower divergence for more highly constrained synonymous sites, disappears if the data are corrected for the correlation between non-synonymous and synonymous divergence, supporting the interference hypothesis. He also showed that the magnitude of the effect is in agreement with what is expected from the observed rate of substitution of positively selected amino acid mutations between D. melanogaster and D. simulans. Our results are in general agreement with this interpretation. Two recent studies of polymorphism and divergence in D. miranda and D. pseudoobscura suggest that a significant fraction of non-synonymous divergence has been driven by positive selection (Bachtrog 2008; Haddrill et al., in preparation), as is required on this hypothesis. Drummond and Wilke (2008) found a negative correlation between F op and d S in Drosophila melanogaster, which they explain by selection against misfolding of proteins. This relationship is the only one in Drummond and Wilke (2008) with which our results do not agree, and the lack of a significant correlation between F op and d S could be due to the limited number of genes in our dataset. However, the Drummond and Wilke (2008) hypothesis predicts that gene expression differences drive all the patterns that they find. Therefore, finding a negative correlation between F op and d N even after correcting for gene expression seems to suggest that either their hypothesis does not explain all aspects of the data, and that hitchhiking effects are involved, or that the gene expression dataset that we used does not capture all relevant features.