Introduction

The domesticated apple, Malus × domestica Borkh., is a member of the Rosaceae family. The family consists of over 100 genera and 3,000 species, most of which are perennial trees, shrubs, and herbs (Tatum et al. 2005). The apple is self-incompatible and highly heterozygous diploid with a base chromosome number of 17. Although the apple is a diploid (2n = 34), it has an allopolyploid origin (Chevreau et al. 1985). The apple is not only a major economic fruit crop grown world-wide, but also serves as an important model species for functional genomics research of woody perennial angiosperms due to its relative small genome size of 750 Mb/haploid (Tatum et al. 2005).

Bacterial artificial chromosome (BAC) libraries have been extensively used in genomics research due to their large DNA inserts, high cloning efficiency, and stable maintenance of foreign DNA. In plants, BAC libraries have been constructed for a variety of species such as Arabidopsis (Choi et al. 1995), rice (Wang et al. 1995), maize (Yim et al. 2002), sorghum (Woo et al. 1994), soybean (Shoemaker et al. 1996; Salimath and Bhattacharyya 1999; Tomkins et al. 1999; Meksem et al. 2000), papaya (Ming et al. 2001), and apple (Vinatzer et al. 1998; Xu et al. 2001). These libraries have made invaluable contributions to plant genomic studies including map-based or positional cloning of genes, genome-wide physical map construction (Mozo et al. 1999; Klein et al. 2000; Chen et al. 2002; Xu and Korban 2002; Shultz et al. 2006; Han et al. 2007), genome sequencing (The Arabidopsis Genome Initiative 2000; International Rice Genome Sequencing Project 2005), and comparative genomics (O’Neill and Bancroft 2000; Ilic et al. 2003).

Earlier, BAC end sequencing has been proposed as a viable and efficient strategy for genome sequencing projects (Venter et al. 1996). Since then, it has become an important component of genomics research efforts as BESs are very useful in genome assembly and chromosome walking. For example, BESs can serve as sequence tag connectors (STCs) for selecting minimum overlapping clones targeted for genome sequencing (Mahairas et al. 1999). BAC end sequence (BES) pairs combined with BAC-fingerprinted contigs can serve as a primary scaffold for whole-genome shotgun sequence assembly. BESs are useful for generating comparative physical maps (Larkin et al. 2003; Shultz et al. 2007b). Moreover, BESs are valuable resources for the development of genetic markers such as BAC-end sequence-based microsatellite markers (Shultz et al. 2007a). In addition, analysis of BES data can provide an overview of the sequence composition, including gene density and presence of potential transposable elements (TEs) as well as microsatellites, of an unsequenced genome (Lai et al. 2006).

Recently, we have developed a genome-wide BAC physical map of the apple (M. × domestica) (Han et al. 2007). In order to develop genetic markers to integrate the physical and genetic maps, a total of ∼2,100 BAC clones, selected from 1,767 different contigs, were sequenced at both ends, and resulting in 3,744 BESs. These BAC clones were selected from different contigs, thus suggesting they were randomly distributed across the apple genome. Hence, BESs derived from these BAC clones provided a unique opportunity to gain insights into the organization of the apple genome. Here, we report on the analysis of 3,744 BESs, and focus our attention primarily on microsatellite content, repeat element composition, GC content, protein-coding regions, and comparative mapping of BAC-end sequence pairs to other sequenced plant genomes. These BESs will serve as useful resources for genetic marker development, integration of physical and genetic maps, and whole genome sequencing of the apple.

Materials and methods

Source of BAC clones and BAC end sequencing

Two complementary BAC libraries (BamHI and HindIII ) from apple cv. GoldRush were used. The BAC vectors for BamHI and HindIII libraries were pBeloBAC11 and pIndigoBAC-5, respectively. BAC clones, picked from 384-well microplates, were inoculated in 96-deep well plates containing 1.5 ml of 2× LB medium plus 12.5 μl/ml chloramphenicol. Plates were incubated at 37°C with continuous shaking at 325 rpm for 20–24 h. BAC DNA was then isolated using a modified alkaline lysis method. BAC end sequencing was performed using an ABI Big Dye Terminator v3.1 (ABI, CA, USA), and analyzed on an ABI 3730x1 instrument. Base-calling and sequence trimming were performed with PHRED software (Ewing and Green 1998) using the default parameters. The output of sequence data was converted into a FASTA format, and vector sequences were masked. Terminal vector sequences were then trimmed, and BESs shorter than 100 bp were discarded.

Identification of simple sequence repeats

Five classes of simple sequence repeats (SSRs), including mono-, di-, tri-, tetra-, and penta-nucleotide tandem repeats, were scanned for all trimmed BESs larger than 100 bp in size. SSRs recorded for the final dataset included monomers with at least 20 repeats and dimers to pentamers with at least 15 bp in length.

Analysis of repetitive sequences

BESs were compared with The Institute for Genomic Research (TIGR) plant repeat databases (ftp://ftp.tigr.org/pub/data/TIGR_Plant_Repeats/) using BLAST at a cut-off value of 10−5. Repetitive sequences were annotated according to the best match in the repeat database, and classified based on TIGR codes for plant repetitive sequences (http://www.tigr.org/tdb/e2k1/plant.repeats/repeat.code.shtml).

Annotation

To identify protein-coding regions, BESs with no homology to the repeat sequence database were compared with the protein database of Arabidopsis thaliana (ftp://ftp.arabidopsis.org/home/tair/Proteins/) using BLASTX at a cut-off value of 10−6. Those BESs significantly matched to the Arabidopsis protein database were annotated based on the original A. thaliana protein database annotation.

Comparative genome mapping

All pairs of BESs were compared with whole genome sequences of Arabdopsis, rice (Oryza sativa) and poplar (Populus trichocarpa) using TBLASTX at a cut-off value of 10−6. Whole genome sequence databases of A. thaliana, rice, and poplar were downloaded from The National Center for Biotechnology Information (NCBI; http://www.ncbi.nlm.nih.gov/genomes/static/euk.html). If a pair of BESs had significant hits, separated by at least 10 kb and not more than 300 kb in the target genome, the tiled BAC was considered to be potentially colinear with the target genome (Lai et al. 2006).

Results

BAC end sequencing

A total of 2,112 BAC clones from cv. GoldRush were sequenced at both ends. Of these BAC clones, 62.2% and 37.8% were from BamHI and HindIII libraries, respectively. Following trimming and vector sequence removal, 3,744 high-quality BESs were generated (Table 1). Of these BESs, 1,717 were paired end reads. The size of these BESs ranged from 100 to 910 bp with an average of 636 bp, thus corresponding to a total length of ∼2.4 Mb. The G + C content of these BESs ranged from 11% to 66% with an average of 39%.

Table 1 Statistical information and composition of apple BAC end sequences (BESs)

Simple sequence repeats

A total of 320 SSRs or microsatellites were discovered within the BESs, and these contained a variety of repeat types (Table 2). Di-nucleotide repeats were the most abundant, accounting for 48.1% of all SSRs, followed by penta- and mono-nucleotide repeats which accounted for 19.4% and 17.2%, respectively. Both tetra- and tri-nucleotide repeats occurred relatively rarely and accounted for 6.9% and 8.4%, respectively, of all SSRs. Of the di-nucleotide repeats, AT/TA was the most abundant, accounting for 56.5% of all di-nucleotide repeats; while, AG/CT, TC/GA, GT/AC, and TG/CA repeats accounted for 18.8%, 11.0%, 9.1%, and 4.5%, respectively. No GC/CG repeats were found in this study. Moreover, length distribution of all SSRs indicated that the frequency of repeats decreased with repeat length (Fig. 1). Among the monomer repeats, A/T was predominant, while G/C occurred very rarely (Table 2). Thus, AT/TA dimer repeats were the most abundant SSRs in apple BESs. In addition, 19 pairs of SSRs were clustered within the same BESs. Of the paired SSRs, 4 tetramers were clustered with both dimers and trimers, and 6 pentamers were clustered with both dimers and trimers.

Table 2 Distribution of simple sequence repeats in apple BESs
Fig. 1
figure 1

Length distribution of different types of SSRs identified within apple BESs

Transposable elements

A total of 3,744 BESs were compared with the plant repeat database revealing that 786 (20.9%) BESs were homologous to TEs (Table 3). Of these potential TEs, class I transposons or retrotransposons represented the most abundant repeats, accounting for 88.2% of TEs. Whereas, class II transposons and miniature inverted repeat TEs were relatively rare, and accounting for 10.9% and 0.9%, respectively. Among the retrotransposons identified in BESs, the total number of long terminal repeat (LTR) retrotransposons, including Ty1-copia and Ty3-gypsy, were 2.8 times higher than those of non-LTR retrotransposons, such as LINE and SINE (Table 3). In addition, more than half of the retrotransposons (54.1%) and most of the transposons (70.9%) could not be clearly assigned to a specific type (Table 3).

Table 3 Summary of potential transposon contents in apple BESs

Protein coding regions

A total of 2,958 BESs with no homology to the plant repeat database was compared with the Arabidopsis nucleolar protein database. Of the total BESs, 323 (8.6%) were homologous to Arabidopsis proteins at an E-value of <1e−19. Functional annotation of putative gene products was then carried out using the Gene Ontology assignment of the Arabidopsis proteome (http://www.arabidopsis.org/tools/bulk/go/index.jsp). The predicted genes covered a broad range of functional categories, such as cellular components, metabolism, signal transduction, and response to stress (Fig. 2). With 8.6% of BESs having homologous sequences to the Arabidopsis protein database, this suggested that the total coding region of the apple genome was approximately 64.5 Mb, based on an estimated genome size of 750 Mb (Tatum et al. 2005). Given the assumption of an average gene length of 2 kb, similar to that of Arabidopsis (The Arabidopsis Genome Initiative 2000), the total gene content of the apple was estimated to be ∼32,250. Moreover, the average GC content of the predicted coding regions of BESs was 43%.

Fig. 2
figure 2

Gene ontology annotation of apple BESs. (a) Cellular component; (b) Molecular function; (c) Biological process

Comparative mapping of apple BAC ends to other plant genomes

In order to gain insight into the syntenic relationships between apple and other plant species, apple BESs were BLASTed against whole genome sequences of three sequenced plants, including A. thaliana, poplar (P. trichocarpa), and rice (O. sativa). If paired BAC ends mapped to the target genome with a span of 10 kb to 300 kb along with proper orientation, then they were deemed potentially colinear with the target genome. A total of 894 BESs, including 107 BAC end pairs, had significant hits to the Arabidopsis genome. Amino acid identities of these hits ranged from 23% to 96% with an average of 49.1%. Of 107 BES pairs, 28 had the top BLAST hit to the same Arabidopsis chromosome and three were mapped to the Arabidopsis genome with a span of 69–300 kb (Table 4). Similarly, when apple BESs were compared with the Populus genome, 1,110 BESs, including 154 BAC end pairs, had significant matches. Amino acid identities of these matches ranged from 20% to 97% with an average of 53.3%. Among 154 BAC end pairs, 15 had the top match to the same Populus chromosome and eight were mapped to the Populus genome with a span of 12–65 kb (Table 4). Moreover, BESs of the eudicot apple were also BLASTed against the genome of the monocot model plant rice. The results revealed that a total of 907 BESs, including 106 BAC end pairs, had significant hits to the rice genome. The amino acid identities of these hits ranged from 20% to 96% with an average of 48.6%. Of 106 BES pairs, 12 had the top hit to the same chromosome. However, no pairs of apple BESs were mapped to the same rice chromosome separated by more than 10 kb and less than 500 kb. This suggested that the colinearity relationship between apple and rice has heavily eroded since the divergence of eudicots from monocots.

Table 4 Comparative mapping of paired apple BAC ends to other plant genomes

Discussion

Analysis of BESs is an efficient approach for developing an understanding of sequence content and complexity of an unsequenced genome (Lai et al. 2006; Cheung and Town 2007). This approach relies on sequencing ends of BAC clones randomly selected from BAC libraries. In this study, we took advantage of the genome-wide BAC-based physical map of the apple, and collected a set of BAC clones. Analysis of BESs from the BAC set has provided an early glance at the apple genome before the whole genome sequence becomes available. The results presented herein indicate that the apple genome contains a large number of potential TEs and microsatellites, and it has a higher degree of colinearity with the Populus genome than with the Arabidopsis genome.

Genomic GC content is one of the most important features of a genome. Genomes with a low GC content are expected to have shorter exons than those with a high GC content (Xia et al. 2003). Based on comparisons of apple BESs with the Arabidopsis protein database, the average GC content of coding regions of the apple genome is ∼43%, which is similar to that of the Arabidopsis genome (∼42.7%; The Arabidopsis Genome Initiative 2000). Moreover, Arabidopsis and apple genomes represent sister clades within the dicot subclass Rosidae. Therefore, it is reasonable to assume that the average gene length of the apple is similar to that of Arabidopsis. Based on this assumption, the total number of apples genes is predicted to be ∼32,250, which is rather consistent with results obtained from analysis of our apple EST database (182,241 5′ and 3′ reads) indicating that the total gene content of apple is ∼29,000 (unpublished data).

Plant genomes contain a variety of TEs such as transposons, retrotransposons, and miniature inverted-repeat TEs (MITEs). The most abundant TEs in plant genomes are retrotransposons and MITEs (Feschotte et al. 2002). In this study, TEs are identified in ∼21% of apple BESs. Of these TEs, 88.2% belong to retrotransposons, thus suggesting that the apple genome consists of abundant retrotransposons. The ratio of Ty3-gypsy to Ty1-copia retrotransposons in apple BESs is 1:3, and it is different from those reported for the Arabidopsis (1:1; The Arabidopsis Genome Initiative 2000) and rice (2:1; International Rice Genome Sequencing Project 2005) genomes. Moreover, ∼11.6% of BESs contain unclassified TEs (Table 3), suggesting that novel repeats constitute a significant portion of the apple genome. On the other hand, MITEs have been reported and are present in high copy numbers in the apple genome (Han and Korban 2007). However, few MITEs have been identified in apple BESs. Similarly, few MITEs have been found in papaya BESs (Lai et al. 2006). The detection of MITEs in BESs may be significantly biased by either the restriction enzyme used to generate the BAC library or the secondary structures of MITEs influencing BAC end sequencing.

SSRs constitute a special class of tandemly repeated DNA. SSRs have several advantages over other molecular markers, including high polymorphism due to the high mutation rate affecting the number of repeat units, abundance in whole eukaryotic genomes, and co-dominant inheritance (Tóth et al. 2000; Katti et al. 2001). SSRs have been extensively used for genome mapping in plants such as rice (Coburn et al. 2002; McCouch et al. 2002), maize (Sharopova et al. 2002), wheat (La Rota et al. 2005; Gao et al. 2004), and papaya (Eustice et al. 2007). BESs are useful resources for the development of SSR markers, and BAC-end sequence-based SSRs have been successfully used to develop genetic maps in cotton (Frelichowski et al. 2006) and soybean (Shultz et al. 2007a). In this study, analysis of apple BESs has revealed that 6.5% BESs contain SSRs. This suggests that the development of BES-based SSRs is a potentially feasible approach for either constructing or saturating the genetic map for apple. Moreover, the most abundant SSRs identified in apple BESs are A/T monomer and AT/TA dimer repeats. This is in agreement with previous findings indicating that AT-rich SSRs are predominant in Arabidopsis (Tamanna and Khan 2005), soybean (Shultz et al. 2007a), and papaya (Lai et al. 2006). In addition, most of the SSRs identified in apple BESs are 20−40 bp in length, and very few SSRs are larger than 50 bp in length (Table 1). The length distribution of apple BES-based SSRs is consistent with a previous finding that the frequency of repeats decreases exponentially with repeat length (Katti et al. 2001).

SSR analysis has been reported for expressed sequence tags (ESTs) from apple (Newcomb et al. 2006). Here, we further compare the composition of BES-based SSRs with that of EST-derived SSRs in apple. AT and AG repeats are the most abundant of di-nucleotide repeats in both BES-based and EST-derived SSRs. Both BESs and ESTs have few GC repeats. However, the frequencies of different types of repeats are different between BES-based SSRs and EST-derived SSRs. For example, AT and AG repeats account for ∼57% and 18.8% of di-nucleotide repeats identified in BESs, respectively; while, AT and AG repeats constitute 7.6% and 88% of di-nucleotide repeats derived from ESTs, respectively (Newcomb et al. 2006). The frequency of di-nucleotide repeats is higher than that of tri-nucleotide repeats for BES-based SSRs; whereas, the frequency of di- and tri-nucleotide repeats in EST-derived SSRs is comparable (Newcomb et al. 2006). These inconsistencies may be attributed to the fact that the composition and frequency of SSRs are different between genomic DNA and coding region sequences. Moreover, it is worth mentioning that the minimum length used to define SSRs is different between BES-based SSRs and EST-derived SSRs. The minimum size of BES-based SSRs is 15 bp; while, it is 12 bp for EST-derived SSRs. The differences in the minimum length of SSRs may also contribute to observed inconsistencies of SSR distribution between BESs and ESTs.

Comparative genetic mapping studies have revealed colinear chromosome segments among closely related species such as Poaceae (Devos and Gale 2000), Solanaceae (Tanksley et al. 1992), and Brassicaceae (O’Neill and Bancroft 2000). However, analysis of colinear chromosome segments is not well suited for distantly related species (Paterson et al. 1996). Recently, with the completion of whole genome sequences of model plants such as Arabidopsis and rice, an alternative analysis approach, microsynteny, has been developed to investigate colinearity among distantly related species. In this study, the extent of colinearity between apple and each of the three sequenced plant species, the eudicots Populus and Arabidopsis along with the monocot rice, has been determined by mapping apple BAC end pairs to the model plant genomes. A total of 154, 107, and 106 apple BES pairs have been identified to be homologous to Populus, Arabidopsis, and rice genomes, respectively. Among these BESs pairs, 8 (5.2%), 3 (2.8%), and 0 BES pairs have been mapped to Populus, Arabidopsis, and rice genomes, respectively, with a span of 10 to 300 kb. The apple and Populus represent two sister orders within the Eurosids I clade; whereas, Arabidopsis is a member of the order Brassicales within the Eurosids II clade. Thus, results presented in this study indicate that the apple has a higher degree of synteny with the closely related Populus than with the distantly related Arabidopsis. Therefore, in the future, comparative genetic mapping can be carried out between apple and poplar genomes using a microsynteny approach. Moreover, 28 BES pairs of apple map to the same chromosomes of Arabidopsis. Among those, 25 map to the same chromosome regions with a span of either <10 kb or more than 300 kb. This finding suggests that the degeneration of microsynteny between apple and Arabidopsis may be due to extensive rearrangements of the Arabidopsis genome (Blanc et al. 2000).