Introduction

Sequence variation in noncoding regions is a rich source of data for evolutionary and population genetic studies, and introns present a particularly useful set of such regions. Intronic sequences are presumed to evolve more rapidly than corresponding nuclear protein coding regions as they are subject to fewer functional constraints (Li 1997; Saitou and Ueda 1994; Clark et al. 1996).

During the molecular characterization of the Alcohol dehydrogenase (Adh) locus in the medfly, Ceratitis capitata, two alleles of the Adh-1 gene were identified that differed in the size of the first intron, with introns of about 2.1 and 2.6 kb respectively (Brogna et al. 1994, 2003). The medfly Adh locus consists of two tightly linked genes Adh-1 and Adh-2, the result of a gene duplication on the left arm of chromosome 2 (Malacrida et al. 1992). The products of the two genes differ in their pI, tissue localization, and developmental profiles (Gasperi et al. 1992, 1994). The interest in the medfly Adh locus stems from its potential use in genetic sexing strategies for the biological control of this important agricultural pest and also as a marker for germ-line transformation and population analysis (Robinson et al. 1986, 1988; Gomulski et al. 1998).

cDNA and genomic clones for the two medfly Adh genes have been isolated and characterized and the tissue and developmental specificity of their expression was studied in a medfly laboratory strain, Benakeion (Brogna et al. 2001). The coding region of both genes is interrupted by introns after codons 61 and 165, positions which are conserved between medfly and Drosophila. Typically for insect Adh introns, these introns are short, between 70 and 162 bp, with the exception of the first intron of Adh-1, which is longer and polymorphic in size. Using an exon-primed intron crossing PCR (EPIC-PCR) approach, Gomulski et al. (1998) revealed a surprisingly high degree of size polymorphism in the first intron of the Adh-1 gene. Eighteen distinct size variants, ranging from approximately 1400 to 3450 bp, were detected in 16 natural populations from five geographical regions in the species range: Africa, Mediterranean Basin, Latin America, Hawaii, and Australia. The medfly population structure inferred from this intron polymorphism is comparable to that deduced from the analysis of allozyme variation over 26 biochemical loci (Gomulski et al. 1998; Malacrida et al. 1998).

Based on their length the 18 intron variants were grouped into four size categories: short (1400 bp), medium (1920–2150 bp), long (2480–2690 bp), and very long (3450 bp). In this study five representative intron size variants were sequenced in order to determine the nature of the size variation. We show that the length polymorphism is mainly due to three different insertion or deletion events, one of which involves a mariner-like transposable element. Restriction map analyses of the introns indicate that the polymorphism at this locus is much more extensive than that deduced from fragment size alone. Phylogenetic analysis suggests that one of the two shortest intron sequences maybe the ancestral variant, whereas the longest variants appear to be the most recent.

Materials and Methods

Wild Material and Laboratory Strains

Sixteen natural populations were sampled throughout the geographic range of C. capitata. A total of 27 samples from these populations was obtained, as pupae, in the years 1992–1995 from the following regions: Africa (Kenya, Réunion, Morocco, Canary), Mediterranean region (Libya, Israel, Crete, Chios, Thessaloniki, Athens, Italy), Latin America (Guatemala, Argentina), Pacific (Mauna Loa, Kauai), and Australia. Four laboratory strains, Benakeion (Rina and Savakis 1991), Ispra, M12, and M66 were also screened for intron polymorphism. A detailed description of these population samples and laboratory strains was presented in Gomulski et al. (1998). The C. rosa flies were collected from Ruiru, Kenya, in November 1998.

DNA Preparation and PCR Amplification

DNA was extracted from individual flies using the method of Baruffi et al. (1995). Following treatment with RNase A, the DNA was extracted with phenol/chloroform, precipitated with ethanol and resuspended in TE (10 mM Tris–HCl, pH 8, 1 mM EDTA). The DNA concentration was quantified using a Hoefer TKO 100 fluorometer.

PCR amplification of the C. capitata introns was achieved using two primers complementary to the two Adh-1 exon sequences flanking the first intron: 5′-ggaagcttgcttatgaagcctgcaagt-3′ and 5′-actttggtttttgggttgatttcc-3′ as described by Gomulski et al. (1998). As these primers did not amplify the C. rosa intron sequences, different combinations of primers based on the C. capitata exon sequences were tested to identify a primer pair that successfully amplified the C. rosa intron sequence. The most efficient primer pair identified was 5′-gtggcttgggcttcattgctta-3′ and 5′-ggggatcctgcaaagcttttatagc-3′ using the same amplification conditions as described by Gomulski et al. (1998) but with an annealing temperature of 58°C.

Amplification of the postdoc Element

Amplification of the postdoc element was achieved by nested PCR using as template the eluted intron PCR product. Only one primer, complementary to the inner 17 bp of the element’s inverted terminal repeat (ITR) sequence, was used: 5′-taagttcgtgcccgatt-3′. Amplification conditions were as described by Gomulski et al. (1997) but with an annealing temperature of 50°C.

Agarose Gel Electrophoresis

PCR products were analysed on 14 × 10-cm 1.5% agarose gel slabs in 1× TBE buffer together with a 50-bp (Amersham Pharmacia Biotech), 100-bp, or 1-kb ladder standard (Roche). The bands were visualized by ethidium bromide staining and exposure to UV light. Bands selected for further analysis were eluted from the gel using the Agarose Gel DNA Extraction Kit (Roche).

Restriction Maps

Restriction maps of the introns were constructed using restriction endonucleases AsnI, EcoRI, SalI, MspI, HaeIII, HindII, and SspI (Roche). These enzymes were selected on the basis of the presence and distribution of their recognition sites in the two available sequences obtained from the Benakeion strain (Brogna et al. 1994, 2003). The digestion was performed on DNA fragments that were eluted from agarose gels. Single, double, and partial digests were performed as required.

Cloning and Sequencing the Size Variants of the First Intron of the Adh-1 Gene

The Be-2060 and Be-2590 intron variants were cloned during attempts to isolate Adh genomic clones by screening a BglII genomic library of the Benakeion strain of C. capitata and using an Adh-1 cDNA clone as probe (Brogna et al. 2003). The Ke-1400B and Ke-1970 intron variants were isolated from flies collected in Kenya and the Mo-3450 variant from a sample from Morocco (Gomulski et al. 1998). The cloning of the Ke-1400B, Ke-1970, and Mo-3450 size variants was achieved by PCR methodology. PCR fragments were cloned into the vectors pGEM-T (Promega) or PCR®2.1 (Invitrogen Corporation). The four C. rosa intron variants were cloned into the PCR®2.1 vector.

DNA sequencing of both strands of the intron variants was carried out by a combination of manual reactions using Sequenase version 2.0 (United States Biochemical) and automated sequencing using an ABI 310 genetic analyzer with the ABI Prism BigDye Terminator Cycle Sequencing ready reaction kit (Applied Biosystems) using vector-specific and sequence-specific primers.

DNA Sequencing Analysis

Sequence comparisons were performed using the BLAST (version 2.2.1; 13 April 2001) family of programs from the National Center for Biotechnology Information (Altschul et al. 1990). The C. capitata and C. rosa sequences were aligned using Clustal X 1.81 (Jeanmougin et al. 1998) and the resulting alignment was adjusted manually. Maximum-likelihood phylogenetic analyses were performed using PAUP 4.0* b10 (Swofford 1998), including only the sequences common to all the introns, i.e., excluding the postdoc, central duplication, 3 indel, and Crmar2 sequences from the analysis and using the C. rosa sequences as an outgroup. The substitution model employed was that indicated as being optimal by Modeltest 3.06 (Posada and Crandall 1998). The confidence of the groupings was estimated using 1000 bootstrap replications. The nucleotide sequences reported in the present study have been deposited in the GenBank database under accession numbers AY426618–AY426626.

Results

Amplification of the Intron Sequences

To study the size polymorphism of the first intron of the Adh-1 gene we used the EPIC PCR technique. The primers used for amplifying the medfly intron were within the flanking sequences of exons 1 and 2 of the Adh-1 gene, thus the amplified products are 126 bp longer than the intron itself. For consistency, the introns were named according to the length of this EPIC fragment as deduced from agarose electrophoresis (Gomulski et al. 1998) prefixed by the first two letters of the geographical locality (or strain) of the sample. Haplotypes of the same intron size variant as deduced from restriction map data are given a one-letter suffix. Thus two haplotypes of 2060-bp EPIC products from Kenyan flies would be denoted Ke-1400A and Ke-1400B.

A total of 18 distinct size variants, ranging from approximately 1400 to 3450 bp, was amplified from 16 natural populations from five geographical regions in the species range: Africa, Mediterranean Basin, Latin America, Pacific, and Australia. These variants were grouped into four size categories: short (1400 bp), medium (1920–2150 bp), long (2480–2690 bp), and very long (3450 bp) (Table 1).

Table 1 Distribution of the 18 intron size variants in the 16 geographic populations and four laboratory strains

Only Part of the Intron Size Variation Is Due to the postdoc Element

As previously shown, intron 1 of the medfly Adh-1 gene contains a mariner-related sequence, postdoc, a defective Ccmar1 element (Brogna et al. 2003; Gomulski et al. 1997). The presence of postdoc suggested that the detected size polymorphism might be due to insertion-deletion variants of the element, similar to those associated with transposable elements in Drosophila and other species including tephritids (Brunet et al. 1996; Gomulski et al. 2001). To test this possibility, a nested PCR approach was used to amplify the postdoc sequences from individual size variants, using a primer complementary to part of the terminal inverted repeat sequence of the element. The results of this analysis can be summarized as follows: The shortest intron variants, Ke-1400A, Ke-1400B, and Ca-1400 (Canary), and two variants from the middle size category, Ke-1950 and Ke-1970, did not show any amplification, suggesting that either these intron variants do not contain the postdoc element or variation at the primer sites impeded amplification (subsequent RFLP and sequencing analyses [see below] confirmed the first hypothesis). All the variants from other size categories (medium, long, very long) contain the element. The size of postdoc is quite polymorphic, there being at least five different size variants of the element detectable by gel electrophoresis. However, the differences in the size of the postdoc variants are generally small (∼20 bp), and the postdoc size variants are distributed among the intron size categories. Therefore, the size polymorphism of postdoc cannot account for the size polymorphism of the entire intron. Extreme examples are the Re-2480 (Réunion) intron variant, which contains a postdoc considerably shorter than that found in the medium category variants, as well as the two medium-size intron variants which lack the postdoc sequence.

Restriction Analysis Reveals the Distribution ofIndels Between the Variants and the Presence of Different Haplotypes

The structure of the intron variants was investigated by means of restriction analysis. More than 100 individual PCR-amplified fragments from 14 geographic populations were analyzed. These fragments represent 15 size variants, from which 23 restriction variants (haplotypes) have been distinguished. The restriction maps of these variants are shown in Fig. 1; they indicate that there are at least three main sources of size variation in the intron. The first is the postdoc element. The presence of this element is denoted by a characteristic SspI–AsnI–HindII–HaeIII–EcoRI restriction fragment pattern, which, with some variations, is generally conserved in most of the variants. Determination of the point of insertion of the element is facilitated by the presence of an SspI restriction site which spans the junction of the intron sequence and the inverted repeat sequence at the 3′ end of the element (the element is inverted with respect to the sense of the Adh-1 gene). The postdoc identified in the Re-2480 intron previously shown to contain a shorter copy of the element in the nested-PCR analysis lacks the HindII and EcoRI sites. The characteristic postdoc restriction pattern is missing in the maps of the two haplotypes representing the shortest size variant (Ke-1400A and B), as well as in two of the medium-size variants, Ke-1950 and Ke-1970, supporting the result of the PCR analysis which indicated that these intron variants do not contain the element.

Figure 1
figure 1

Restriction maps of 23 intron haplotype variants corresponding to 15 size variants. Exon sequences are shown in black, intron sequences in white, and the postdoc element is shaded. The sites of the following restriction endonucleases are shown: AsnI (A), EcoRI (E), HaeIII (H), HindII (I), MspI (M), SspI (P), and SalI (S).

Second, inspection of the maps of six representatives of the long and very long size categories clearly indicates that they all contain an additional sequence in the region immediately upstream of the HaeIII– MspI doublet. This additional sequence is marked with a central SspI site, and for simplicity, it will be called the 3 indel. An insertion that, due to its size and position, could be homologous to the 3 indel, but lacks the central SspI site, is present in the two medium-size variants that lack the postdoc element (Ke-1950 and Ke-1970).

Third, the intron of the very long category, Mo-3450, clearly contains an additional sequence of about 700 bp immediately downstream of the postdoc element. It is recognized by the same restriction pattern, SspI–SspI–AsnI, which is also present downstream in the intron, suggesting that it may have arisen as a duplication, hence it will be called the central duplication.

Finally, most of the size variation within each of the medium and long categories appears to be due to short insertions/deletions present in the postdoc element, as well as in the rest of the intron.

Within each intron size variant, restriction mapping reveals additional polymorphism; the variants 1400, 1920, 2000, 2060, and 2630 exhibited different restriction enzyme haplotypes in different populations. For example, two haplotypes of the short variant, 1400, are present in Kenya (A and B) which differ in the presence/absence of an EcoRI site; only the B haplotype was found in Canary. Within the medium-size category, two haplotypes of the 1920 size variant which differ at six restriction sites were detected from Kenya and Morocco. Three haplotypes of the 2000 variant were found; the Ke-2000A haplotype differs from the Ke-2000B and Re-2000 haplotypes by 10 and 7 sites, respectively. Four haplotypes of the 2060 variant were identified. The Ke-2060A haplotype differs from that of the Ke-2060B at five sites. A third 2060 haplotype, originally identified in the Benakeion strain, Be-2060, is widespread in nine populations (Kenya, Libya, Morocco, Israel, Crete, Chios, Thessaloniki, Italy, and Guatemala). The fourth haplotype, identified only in Réunion, differs from the Be-2060 and Ke-2060A haplotypes for the lack of the entire SspI–SalI–HindII–SspI pattern just downstream of postdoc.

Among the long size variants, two haplotypes of the 2630 variant were identified in Kenya and Guatemala, respectively. In contrast, the long variant Be-2590, originally identified in the Benakeion strain, shows no evidence for restriction site polymorphisms with the seven restriction enzymes used, in 30 examples from 11 geographic populations (Kenya, Réunion, Libya, Israel, Crete, Chios, Thessaloniki, Athens, Italy, Argentina, and Australia). This haplotype was also invariant in the four laboratory strains surveyed.

Molecular Basis of the Intron Size Polymorphism in C. capitata

Five intron size variants were selected for study, each being a representative of the four intron size categories: the Ke-1400B (short) intron variant? two variants from the medium category, Ke-1970 and Be-2060, which, despite their similar sizes, displayed very different restriction maps; the Be-2590 variant (the most common representative of the long size category); and the Mo-3450 (very long) variant.

The multiple alignment of the five introns reveals a significant degree of variation. Firstly, length polymorphism is primarily due to the occurrence of three distinct insertions/deletions (indels) (Fig. 1). The Ke-1400B variant is the shortest and does not carry any of the three indel sequences. The Ke-1970 variant contains an indel of 566 bp, the 3 indel, which shows no significant homology with sequences in the databases using nucleotide or translated BLAST searches. Furthermore, it is not flanked by inverted or direct repeats that might suggest that it is a transposable element. The Be-2060 variant is characterized by the presence of a 626-bp sequence upstream of the 3 indel, flanked by perfectly conserved 30-bp inverted terminal repeats and showing sequence homology to the mariner-like elements. This postdoc element is inserted in the opposite orientation with respect to the Adh-1 gene. Typically for a mariner/Tc1-like element, postdoc is flanked by TA dinucleotides. However, the central region of this element, which is occupied by the transposase gene, carries several deletions, frame shifts, and stop codons. The Be-2590 variant contains both the postdoc element and the 3 indel. The Mo-3450 variant contains, in addition to the postdoc element and the 3 indel, a third indel, 704 bp in length. This sequence, the central duplication, is a duplication that includes 258 bp of the postdoc element, the region between postdoc and the 3 indel, and 20 bp of the 3 indel.

Second, extensive sequence polymorphism exists among the five variants. The intron variants share 90.8–99.7% DNA identity. Sequence comparisons of the regions common to all variants, the postdoc element and the 3 indel (Table 2), indicated that the most similar variants are the Be-2590 and Ke-3450 variants, with almost 99.7% DNA identity. The Ke-1400B variant appeared to be most similar to the Ke-1970 variant, while the Be-2060 variant is closer to the longer variants than to the Ke-1400B or Ke-1970 variants. The duplicated sequences, corresponding to the central duplication, within Mo-3450 showed about 99% identity.

Table 2 Percentage nucleotide identities between the five representative intron variants at sequences common to all intron variants, at the postdoc element, and at the 3 indel

Intron Size Polymorphism in C. rosa

The homologous Adh-1 intron from the congeneric species C. rosa was studied in order to obtain an outgroup for comparative and phylogenetic analysis. The choice of C. rosa as an outgroup was based on its phylogenetic proximity to C. capitata, despite its classification in another subgenus, Pterandrus (De Mayer 2000). Unfortunately specimens of other species belonging to the same subgenus as C. capitata, Ceratitis, were unobtainable.

Amplification of the intron resulted in four variants ranging in size from about 1100 to 2000 bp. These variants were named according to the length of the EPIC fragment as deduced from sequencing (Ke-1136, Ke-1247, Ke-1259, and Ke-2015).

The four C. rosa intron variants shared 91.3–94.2% DNA identity and shared 77.9–80.2% identity with the C. capitata intron variants. All the C. rosa intron sequences lack the postdoc element and central duplication sequences. However, all but one of the C. rosa sequences contain part of the 3 indel sequence, the major part of this indel apparently having been lost in a deletion which extends into the intron sequence downstream of the 3 indel. The C. rosa Ke-2015 intron retains the longest remnant of the 3 indel (118 bp), which shares 73.7–74.6% identity with the corresponding sequence in the C. capitata introns. The C. rosa Ke-2015 intron also contains a 923-bp sequence, Crmar2, flanked by imperfect 32/33-bp inverted terminal repeats with sequence homology to the Tc1/mariner-like superfamily of elements (Gomulski et al. 2001). This element is inserted at a position about 400 bp downstream of the corresponding postdoc insertion site in the C. capitata introns. The other C. rosa and C. capitata intron sequences all lack the Crmar2 element.

Phylogenetic Analysis

Sequence comparisons were used to determine the evolutionary relationships between the intron size variants. The sequences of the five representative C. capitata variants and four C. rosa variants were aligned. The alignment formed the basis of a phylogenetic analysis of the introns with the C. rosa introns as an outgroup and excluding all gapped sites and the four indel/transposon sequences (postdoc, central duplication, 3 indel, and Crmar2). Using Modeltest 3.06 (Posada and Crandall 1998) under the Hierarchical Likelihood Ratio tests (hLRTs), we concluded that the HKY+Γ model (see legend to Fig. 2 for details) was the most appropriate model of evolution for this data. Using this model a maximum likelihood tree was generated using PAUP 4.0* b10. A bootstrap of 1000 replications was also employed. In the resulting tree (Fig. 2) the C. capitata variants form two main clusters, the first of which contains the two shortest variants (Ke-1400B and Ke-1970). The other cluster contains the Be-2060 variant and a further bifurcation supported by a 100% bootstrap value containing the longest variants (Be-2590 and Mo-3450). The short branch lengths suggest that the Be-2590 and Mo-3450 sequences are the most closely related of the intron variants.

Figure 2
figure 2

Phylogenetic tree of the five C. capitata and four C. rosa intron sequences constructed using the maximum likelihood method in PAUP* 4.0 b10 (Swofford 1998). The postdoc, central duplication, 3 indel, and Crmar2 sequences and all gapped sites were excluded from the analysis. A variant of the HKY85 model, HKY+Γ, was employed and transition/transversion ratio was estimated as 0.6405. The tree is the 50% majority rule consensus of 1000 bootstrap replicates with the full heuristic search option. The numbers on the branches are the bootstrap values. The C. rosa intron sequences (named according to their sequence length) were treated as an outgroup. The tree has an estimated ln likelihood score of −2406.86.

Discussion

The first intron of the medfly Adh-1 gene exhibits a high degree of size variation. In this study, five representative medfly intron size variants and four intron variants from C. rosa were cloned and sequenced in order to determine the molecular nature of this size variation.

Analysis of the medfly intron restriction maps and sequences indicated that the four size categories could be explained by the presence or absence of three indels, the postdoc, 3indel, and central duplication. The short category intron lacks all three indels, that of the medium category contains either postdoc or the 3 indel, the large category intron contains both postdoc and the 3 indel, and the very large category intron contains all three indels. The other size variants within the medium and large categories which differ by 20–40 bp are most likely explained by the numerous small indels (1–30 bp) dispersed throughout the intron.

The Ke-1400B and Ke-1970 intron variants, like those of C. rosa, lack the postdoc sequence. The absence of excision relics, in particular, the TA target duplication typical of mariner integration (van Luenen 1994; Hartl et al. 1997), indicates that these variants never contained the element. This suggests that the Ke-1400B or Ke-1970 variants are the most representative of the ancestral intron sequence. Phylogenetic analysis of the five intron sequences, using the C. rosa intron variants as an outgroup, indicated that the Ke-1400B and Ke-1970 variants cluster together. The Be-2590 and Mo-3450 variants also cluster together and the branch lengths suggest that they are the most closely related of the variants. However, given that the Mo-3450 intron is the only variant containing both the 3 indel and the central duplication, which is a partial duplication of both the postdoc and the 3 indel, suggests that Mo-3450 is the most recent of the five variants.

The origin of the 3 indel is intriguing, as it has none of the characteristics of a transposable element and shares no significant identity with any sequence in the databases. However, the C. rosa intron sequences contain at least part of the 3 indel, suggesting that it was present in the ancestral intronic sequence before the divergence of the two species. This suggests that the Ke-1970 variant (medium category) might represent the ancestral form of the intron, the 3 indel having been deleted in a subset of the introns which gave rise the shorter 1400-like variants (short category). The invasion of the genome by the postdoc element would thus have given rise to Be-2060-like introns (medium category) and Be-2590-like introns (long category). The Mo-3450 variant (very long category) may have been derived from the Be-2590 variant by unequal crossing-over or a DNA slippage event.

The base composition of the intron sequences appears to corroborate this sequence of events. The medfly genome is very A+T rich (68.1% [Ooka and Neulat-Portier 1969)]. An analysis of the genomic sequences of Adh-1 and 15 other medfly nuclear genes (Table 3), covering almost 46 kb, indicated that the composition of the coding sequences (52.5% A+T) is very different from that of the intron (68.0% A+T), 5′ (66.7% A+T) and 3′ flanking (66.4% A+T) sequences. The Adh-1 first intron as a whole is extremely A+T rich (70.4–78.3%) compared to the coding sequence of the gene (53% A+T). This high A+T content has been shown to be a general property of introns with respect to coding sequences, yeast and mammals being notable exceptions (Csank et al. 1990). There are evident differences in the composition of the different parts of the first intron of Adh-1. The common intronic sequence (excluding the three indel sequences) has the highest A+T content (77.2–78.3%), the 3 indel, 64–65.9% A+T, while the postdoc sequence composition approaches that of the coding sequence (58.4–59.9% A+T). The base composition of the postdoc element may be a legacy of its recent coding function, whereas the high A+T content of the 3 indel suggests a noncoding or long lost coding function for this sequence.

Table 3 Sixteen C. capitata complete or partial genomic sequences of nuclear genes considered in the analysis of sequence composition and intron lengtha

The postdoc element inserted in the intron is clearly nonfunctional. Gomulski et al. (1997) isolated a full-length copy of this element, Ccmar1, and showed that it is a peripheral member of the mellifera subfamily of mariner elements. Torti et al. (2000) estimated that 900 copies of Ccmar1 are present in the diploid genome, almost exclusively in euchromatic regions. This suggests that it is a relatively recent acquisition according to the transposition/selection model, which predicts that newly acquired elements are expected to be preferentially located in the euchromatin and more ancient and inactivated elements in the heterochromatin (Langley et al. 1988; Charlesworth et al. 1994, 1997). Indeed, the element was estimated to have entered the genomes of C. capitata and C. rosa or their common ancestor about 2.3 Mya (Torti et al. 1998). The majority of the C. capitata intronic variants contain a copy of this element, whereas it was absent in the variants of the C. rosa intron examined. The absence of postdoc in the C. rosa intron is interesting, given that 5000 copies of the element are present in the C. rosa genome (Torti et al. 1998).

On the contrary, the 2015-bp variant of the C. rosa intron contains an internally deleted copy of the Crmar2 element (Gomulski et al. 2001). The Crmar2 element not only is absent from all the C. capitata intron variants analyzed but also apparently absent from the entire medfly genome, it having invaded the C. rosa genome some 0.75 Mya, after the divergence of the two tephritid species (Gomulski et al. 2001).

The length of the first intron of Adh-1 appears to be a common feature of tephritid species. This intron ranges from 1269 to 3166 bp in C. capitata, 1136–2015 bp in C. rosa, and 1070 bp in Bactrocera oleae (Goulielmos et al. 2001). This compares to an average intron length of 496 bp (range, 59–9715 bp) of 31 introns from 16 C. capitata nuclear genes available in the GenBank database (Table 3). However, the distribution of intron size is highly skewed toward short introns, the median intron length being 79 bp. Mount et al. (1992) and Comeron and Kreitman (2000) found that there is also a highly asymmetrical intron length distribution in Drosophila melanogaster. An analysis of 1345 introns showed that there is a bimodal distribution in intron size, the majority being less than 90 bp in length whereas over 25% are longer than 400 bp. Short D. melanogaster introns typically lack polypyrimidine stretches, whereas the long introns are similar to those of mammals which typically have more than 80% pyrimidine content near the 3′ splice site. The medfly Adh-1 first intron variants neatly fit into the long intron class as apart from their length, the pyrimidine content near the 3′ splice site is very high (>90%) in all the variants sequenced.

It is accepted that there are structural constraints limiting minimum intron length below which correct splicing is inefficient (Upholt and Sandell 1986; Tsurushita and Korn 1987; Mount et al. 1992), whereas longer introns would presumably incur extra time and energy costs during replication and transcription, particularly in highly expressed genes (Castillo-Davis et al. 2002). Interestingly, the medfly Adh-2 gene is expressed at a much higher level than Adh-1 (Brogna et al. 2001) and has a much shorter first intron. However, in Drosophila there is an inverse correlation between intron length and recombination frequency, that is, longer introns tend to be present in regions of low recombination. One explanation for this observation is that longer introns enhance recombination between mutations under the influence of selection in adjacent exons (Comeron and Kreitman 2000). The selective advantage of longer introns is expected to be greatest when the recombination rate is low and to diminish with increasing recombination unless the number of beneficial mutations is large (Barton 1995; Otto and Barton 1997). However, Carvalho and Clark (1999) noted that the correlation between intron size and recombination rate was true only for introns shorter than 100 bp. They suggest that natural selection acts against long and very short introns and hence these tend to be found in regions of low recombination where the efficiency of selection is reduced (Hill and Robertson 1966). Schaeffer (2002), however, in a study of the Adh region in Drosophila pseudoobscura, found that selection tended to favor increased intron size despite high recombination. In wild native Kenyan populations of C. capitata the Adh-1 first intron varies extensively in size, suggesting that the sequence is selectively neutral or nearly so. In most of the derived populations only a few of the size variants are present (Table 1) (Gomulski et al. 1998), as expected with a finite founder population size, although the possibility of some form of selection cannot be excluded.

Introns often contain regulatory elements, alternative promoters or antisense promoters (Laurie and Stam 1994). These functional elements tend to be present at relatively high frequencies in the first intron of protein-coding genes (Duret 2001), which also tends to be the longest intron. Despite the limited medfly sequence data (only six complete or partial genomic sequences of nuclear genes containing multiple introns including the first intron are available in the databases; Table 3), it was possible to confirm that first introns also tend to be longer in this species (Mann–Whitney, p = 0.039), the mean length of the first and subsequent introns being 2077 ± 1553 (SE) and 81.7 ± 10.9 bp, respectively.

The genetic aspects of the medfly colonization process have been studied through variation in proteins (Gasperi et al. 1991; Malacrida et al. 1998), mitochondrial DNA (McPheron et al. 1995; Gasperich et al. 1997), RAPDs (Baruffi et al. 1995; Haymer et al. 1997), and microsatellites (Bonizzoni et al. 2000, 2001). Several studies have illustrated the usefulness of intron variation in the analysis of medfly population structure and colonization events (Gomulski et al. 1998; Villablanca et al. 1998; Davies et al. 1999). Villablanca et al. (1998) and Davies et al. (1999) used EPIC-PCR to examine variation at four intron loci within the muscle-specific actin, chorion s36, vitellogenin 1 gamma, and Cu/Zn superoxide dismutase genes. Unlike at the Adh-1 locus these intronic sequences were generally homogeneous in terms of length, the majority of the polymorphisms being due to single-base substitutions. However, the resulting multilocus genotype of individual flies was shown to be a powerful tool for the study of population structure with sufficient resolution to distinguish the source of recent bioinvasions (Davies et al. 1999). The addition of ultravariable introns such as the Adh-1 first intron will increase the power of this technique for detecting medfly bioinvasion sources.