Introduction

Chloroplasts, plant cell organelles derived from independent living cyanobacteria (Keeling 2004) contain the entire enzymatic machinery for photosynthesis. In addition, several other biochemical pathways are present including those responsible for the biosynthesis of fatty acids, amino acids, pigments, and vitamins. Chloroplasts contain their own small genome and this generally has a highly conserved organization. In angiosperms, the genome is circular with a quadripartite structure that includes two copies of an inverted repeat (IR) that are usually 20–26 kb in length and that separate one large single copy (LSC) region and one small single-copy (SSC) region (Palmer et al. 1985). Chloroplast genomes are 108–217 kb, with the vast majority in the 150–170 kb range. Most contain 110–130 distinct genes; the majority of these genes (about 80) code for proteins that are generally involved in photosynthesis or gene expression, with the remainder being transfer RNA (about 30) or ribosomal RNA (4) genes (Raubeson and Jansen 2005).

There has been a rapid increase in our understanding of chloroplast genome organization and evolution, owing to the availability of many new completely sequenced genomes. Since the first report on the complete chloroplast genome of Liverwort (Ohyama et al. 1986), more than 150 complete chloroplast genomes from plants and algae have been deposited in GenBank so far. Comparative studies indicate that the chloroplast genomes of land plants are highly conserved in both gene order and gene content. In several lineages of land plants, chloroplast DNAs have multiple rearrangements, including Pinus (Wakasugi et al. 1994) and the angiosperm families Campanulaceae (Cosner et al. 1997), Fabaceae (Milligan et al. 1989; Palmer et al. 1988), Geraniaceae (Palmer 1987), and Lobeliaceae (Knox and Palmer 1998).

The pace of chloroplast genome sequencing has increased markedly over the last 5 years (Jansen et al. 2005) driven largely by improvements in Sanger sequencing technology that have greatly reduced time and cost (Metzker 2005). In addition, new sequencing technologies by pyrosequencing platforms such as the Roche 454 system (Branford, CT, USA; available through Roche Diagnostics, Indianapolis, IN, USA) have been proposed in recent years that will further significantly reduce the time for and cost of obtaining whole chloroplast genome sequences (Huse et al. 2007; Moore et al. 2006; Tangphatsornruang et al. 2010; Yang et al. 2010). In a single run, the 454 system (Roche 454 GS FLX Titanium) generates up to 400 million high-quality bases in hundreds of thousands of short sequence reads called flowgrams, which are then assembled into genomic contigs. For relatively small genomes, the high number of reads results in a high average depth of sequence coverage, effectively overcoming many of the limitations of pyrosequencing, such as relatively short read length, and uncertainty in the length of homopolymer runs (Margulies et al. 2005; Ronaghi et al. 1998). Perhaps the greatest advantage of the 454 system is that it generates genome sequence much more rapidly and economically than traditional Sanger-based shotgun sequencing.

There are many advantages to using chloroplast DNA for taxonomy and evolutionary research: (1) it is of small size, has high copy number, and has a simple structure; (2) gene content and arrangement are more conserved than in mitochondrial and nuclear genomes making it easier to design primers; (3) it is maternally inherited and thus without the genetic re-assortment that interferes with molecular phylogenetic relationships. In addition, information about the chloroplast genome can be used for various research, chloroplast transformation (Maliga 2002), the development of crops with good agricultural traits (Bock and Khan 2004; Daniell et al. 2004).

Pear (Pyrus) is an important economic crop and a member of the Rosaceae family, that has been cultivated for more than 2,000 years and is among the most important fruits in all the temperate regions in about 50 countries of the world (Bell 1990). The classification of Pyrus is often very difficult due to natural or artificial interspecific hybrids, which arise easily because of self-incompatibility and the lack of distinguishable characters between species. To establish the phylogenetic classification of Pyrus, various comparisons have been tried using morphological characters, phenolic compounds, and DNA markers (Aldasoro et al. 1996; Bell 1990; Challice and Westwood 1973; Rehder 1940; Rubtsov 1944; Volk et al. 2006; Yamamoto et al. 2002). Until now, few chloroplast-derived markers have been used to study evolutionary relationships among Pyrus species (Iketani et al. 1998; Katayama and Uematsu. 2003; Kimura et al. 2003; Katayama et al. 2012). Additional markers are required to determine more detailed relationships among Pyrus, and for improved phylogenetic classification within the Rosaceae family in general which is currently controversial. We hope to add to the available information by assessing chloroplast genome variation using structural alterations such as indels and microsatellites. Two complete Rosaceae chloroplast genomes Malus × domestica (GDR/Genome Database for Rosaceae, http://www.rosaceae.org/projects/apple_genome) and Prunus persica (Jansen et al. 2011) were reported last year. However, detailed comparisons between these chloroplast genomes have not yet been achieved. Comparison of the three chloroplast genomes should reveal genome structures such as highly variable regions in the Rosaceae.

In this study, we present the complete sequence of the chloroplast genomes of pear (Pyrus pyrifolia) and carry out some comparative analyses with other known chloroplast genomes, especially in Rosaceae. We use one of the next-generation sequencing method—pyrosequencing (Roche 454 GS FLX Titanium) for complete chloroplast genome sequencing.

Materials and methods

Total DNA was isolated from fresh green leaves of a Japanese pear variety “Housui” (P. pyrifolia) using Genomic DNA Buffer set and QIAGEN Genomic-tip 20/G (QIAGEN). The DNA was sheared by nebulization, amplified by emulsion polymerase chain reaction (PCR), and sequencing was performed by the pyrosequencing method using the Genome Sequencer Roche 454 (GS)-FLX Titanium (Roche Diagnostics; Margulies et al. 2005). The raw sequences were trimmed with quality score less than 20. Adapter sequences were also trimmed, and cleaned sequences less than 150 bases in length were discarded using CLC genomics Workbench ver. 3.7.1 (CLC bio, Aarhus, Denmark). Mapping assembly was generated with the reference sequence, the complete chloroplast genome of Nicotiana tabacum (Z00044), using the CLC genomic workbench applying default parameters.

Misread and unread sequences were amplified by PCR and completed by Sanger sequencing. The primers used for sequencing are listed (Electronic supplementary material (ESM) Table 1). PCR was carried out by PrimeSTAR DNA polymerase (TAKARA BIO). Sequencing reactions were performed by Big Dye Terminator v3.1 (Applied Biosystems) and applied to ABI3100 DNA sequencer (Applied Biosystems).

The genomic sequence was annotated using the program Dual Organellar GenoMe Annotator (Wyman et al. 2004). Searches against a custom database of the previously published chloroplast genomic sequences using BLASTX were used to verify the predicted annotations (Altschul et al. 1990). Intron positions and rRNA genes were determined based on those of the Prunus chloroplast genome (Jansen et al. 2011). The tRNA genes were confirmed using tRNAscan-SE (ver. 1.23) (Lowe and Eddy 1997). The circular chloroplast genome map was drawn by the GenomeVx program (Conant and Wolfe 2008).

Translocation and inversion were examined by pair-wise comparisons between three genomes using PipMaker (Schwartz et al. 2000). Multiple alignments of the 11 complete chloroplast genomes Malus × domestica (GDR/Genome Database for Rosaceae; http://www.rosaceae.org/projects/apple_genome), P. persica (NC_014697; Jansen et al. 2011), N. tabacum (NC_001879; Shinozaki et al. 1986), Cucumis sativus (NC_007144; Kim et al. 2006), Vitis vinifera (NC_007957; Jansen et al. 2006), Arabidopsis thaliana (NC_000932; Sato et al. 1999), Citrus sinensis (NC_008334; Bausher et al. 2006), Spinacia oleracea (NC_002202; Schmitz-Linneweber et al. 2001), Zea mays (NC_001666; Maier et al. 1995), Oryza sativa (NC_001320; Hiratsuka et al. 1989), Triticum aestivum (NC_002762; Ogihara et al. 2000), and P. pyrifolia (present study) as the reference were performed using VISTA (Mayor et al. 2000). The number of nucleotide substitutions, indels, and inversions between Pyrus and Malus were tailed for 89 noncoding chloroplast DNA regions (≥100 bp). Indels, nucleotide substitutions, and inversions were scored as independent, single characters. The proportion of mutational events = (NS + ID + IV)/L × 100, where NS = the number of nucleotide substitutions, ID = the number of indels, IV = the number of inversions, and L = the aligned sequence length was calculated as above.

Phylogenetic analyses were performed on an aligned data matrix that included 35 taxa of angiosperms and 81 protein coding genes (atpA, atpB, atpE, atpF, atpH, atpI, ccsA, cemA, clpP, infA, matK, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaI, psaJ, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ, rrn16, rrn23, rrn45, rrn5, rpl22, rpl23, rps16, rbcL, rpl14, rpl16, rpl2, rpl20, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps18, rps19, rps2, rps3, rps4, rps7, rps8, ycf3, ycf2, and ycf4). Amino acid sequences were aligned using Multiple Sequence Web viewer and Alignment Tool (http://mswat.ccbb.utexas.edu) and manually adjusted. The amino acid alignment was used to constrain the nucleotide alignment. Maximum parsimony (MP) analysis was performed using PAUP version 4.0b (Swofford 2003). MP analysis was performed with 100 random additional replicates and TBR branch swapping with the Multrees option. Nonparametric bootstrap analyses (Felsenstein 1985) were performed for 1,000 replicates with one random additional replicate and TBR branch swapping with the Multrees option.

Results

Genome assembly and validation

Pyrus genomic DNA was sequenced using 454 Life Sciences technology on the GS FLX system. A total of 2,654,964 reads were generated with an average length of 375 bases that covered 995 Mb. These reads were cleaned and the remaining reads (2,254,379 reads with an average 410 bases) were assembled with reference to the N. tabacum chloroplast genome. The average genome sequencing depth of each nucleotide on the Pyrus chloroplast genome was 178×, and 60,313 chloroplast-related reads (2.68%) were collected with an average length of 426 bases that covered 25 Mb. There were 151 gap (mis-assembled) regions and 28 uncertain homopolymers in the assembled sequences.

Size, gene content, and organization of the Pyrus chloroplast genome

The complete chloroplast genome of Pyrus (GenBank/EMBL/DDBJ accession number: AP012207) was 159,922 bp in length (Fig. 1) and included a pair of IR of 26,392 bp separated by one small and one large single copy region (SSC and LSC) of 19,237 and 87,901 bp, respectively. The Pyrus chloroplast genome contained 113 unique genes, 17 of which were duplicated in the IR, giving a total of 130 genes (Table 1). There were four ribosomal and 30 tRNA genes, and seven tRNA genes and all the rRNA genes were duplicated in the IR. Eighteen genes contained one or two introns, and six of these were located in tRNAs. The numbers and kinds of tRNA genes from the Pyrus chloroplast genome were identical to that of well-characterized vascular plants. The genome consisted of 55.48% coding regions and 44.52% noncoding regions, including both intergenic spacers and introns. The overall GC and AT content of the Pyrus chloroplast genome was 36.58% and 63.42%, respectively.

Fig. 1
figure 1

Gene map of the P. pyrifolia chloroplast genome. The thick line indicates the extent of the IRs (IRa and IRb) which separate the genome into SSC and LSC regions. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside of the map are transcribed in the counterclockwise direction. Genes containing introns and pseudogenes are marked with * and **, respectively

Table 1 Genes locating on the Pyrus chloroplast genome

Comparison of the whole chloroplast genome among angiosperms

Dot plot analysis showed that gene order and organization in Pyrus was similar to Malus and Nicotiana (Fig. 2). Neither translocation nor inversion were detected in the three plants species. Malus was shown to be the closest relative to Pyrus with global alignments using VISTA (Fig. 3). There were considerable differences between Pyrus and monocots such as large inversions with the identity plot patterns. Length mutations, such as indels more than 10 bp, were picked up by comparing the sequences of Pyrus, Malus, and Prunus (Table 2). With the exception of ycf1 and ycf2 which had deletions in the coding region, all the indels were detected in the spacer or intron regions. The indels of ycf1, ycf2, trnL-ndhB, and trnR-trnN were located within the IR region. In total, 46 indels of 23 insertions and 23 deletions were detected when comparing the chloroplast genomes of Pyrus and Malus. In the comparison between Pyrus and Prunus, there were a total of 174 indels (69 insertions and 105 deletions). Three insertions and 13 deletions were detected in Pyrus compared to the same locus in Malus and Prunus. A comparison of the mutational events within 89 noncoding chloroplast DNA regions in Pyrus and Malus revealed that intergenic spacer ndhC-trnV was most variable with a percentage variability of 6.07% (Fig. 4). The average percentage variability was 1.10%. In intergenic spacer ndhC-trnV, 41 mutational events were detected; 32 were nucleotide substitutions and nine were length mutations in which two indels had tandem repeats.

Fig. 2
figure 2

Dot plot. A P. pyrifolia versus Malus x domestica. Numbers along the x-axis indicate the coordinates for Pyrus and along the y-axis for Malus. B P. pyrifolia versus N. tabacum

Fig. 3
figure 3

Identities of several dicot and monocot species. VISTA plot (global alignment) comparison of pear chloroplast genome with 11 chloroplast genomes. Y-scale represents the percent identity ranging from 50% to 100%. Genomes are arranged according to the number of conserved bases with respect to Pyrus

Table 2 Loci of indels (≥10 bp) found in chloroplast genomes among three genera in Rosaceae
Fig. 4
figure 4

The proportion of mutational events between Pyrus and Malus within 89 noncoding chloroplast DNA regions. The proportion of mutational events = (NS + ID + IV)/L × 100. NS = the number of nucleotide substitutions, ID = the number of indels, IV = the number of inversions, and L = the aligned sequence length

IR expansion/contraction

Figure 5 shows the detailed IR–SC border positions with respect to the adjacent genes in Pyrus, Malus, Prunus, Arabidopsis, and Nicotiana. In Pyrus, the IRa/SSC borders (position 133,432) were located in the 3′ region of the ycf1 gene and created the ycf1 pseudogenes of length 1,073 bp, at the IRb/SSC border (position 114,194). A similar structure was also observed in Malus, Prunus, Arabidopsis, and Nicotiana. In Pyrus, the IRa/LSC borders (position 159,823) were located downstream of the noncoding region of trnH-GUG while the IRb/LSC borders (position 87,803) were located within the coding region of rps19. Thus, a 3′-truncated rps19 pseudogene was created at the other border (IRa/LSC) with a length of 120 bp (position 159, 703–159, 823). These features were similar in Malus, Prunus, and Arabidopsis but in Nicotiana, there was no rps19 pseudogene at the other border (Fig. 5). Figure 6 compares the sequences of the IR/SC junction regions of Pyrus, Malus, and Prunus. In Prunus, the IR/LSC borders were 62 bp longer in the IR region than in Pyrus within the IR region. In addition, there were length mutations between the IRa/LSC junction and trnH but there were fewer length mutations in the IR/SSC borders. The IR/SSC junctions of Pyrus and Malus were only 2 bp longer than that of Prunus within the IR region.

Fig. 5
figure 5

Detailed view of the inverted repeat-single copy (IR/SC) border regions among five chloroplast genomes. Annotated genes or portions of genes are indicated by gray boxes above or below the genome

Fig. 6
figure 6

Alignment of the IRb/SSC border region in the genus Pyrus, Malus and Prunus. Underlined sequences are within the IR

Simple sequence repeats

Chloroplast simple sequence repeats (SSR), which align more than 10 repeated motifs, were investigated. Sixty-seven SSR loci from the Pyrus chloroplast genome were identified (Table 3). Thirty-one A stretches (10–22 bases), 34T stretches (10–24 bases), and 2C stretches (10 bases) were located within the genome but there were no G stretches. Forty-eight of the 67 SSR loci were detected in the intergenic spacers and were composed of A or T stretches. Fourteen of the 67 SSR loci were found within the intron region but only four were located in the gene coding regions.

Table 3 Distribution of simple sequence repeats (SSRs) loci in the Pyrus chloroplast genome

Phylogenetic analysis

Phylogenetic analyses were performed on an aligned data matrix that included 35 taxa of angiosperms and 81 protein-coding genes. The total length of the aligned data set was 62,384 nucleotides. MP analyses resulted in a single, fully resolved tree with a length of 102,898, a consistency index of 0.48 and a retention index of 0.49 (Fig. 7). Bootstrap analyses indicated that 30 of the 32 nodes were supported by values ≥95%. The remaining two nodes had bootstrap values 90%. The MP trees were largely congruent with each other and with recent phylogenetic analyses based on complete chloroplast genomes (Jansen et al. 2007). MP analyses provided strong support for the monophyly of EurosidIand for the placement of Pyrus and Prunus in that clade as sister to Morus.

Fig. 7
figure 7

Phylogenetic relationships of 35 taxa based on 81 chloroplast genes using maximum parsimony (MP). Bootstrap values (1,000 replications) are shown at the nodes

Discussion

The sequence of the complete chloroplast genome of Pyrus was determined using pyrosequencing (Roche 454 GS FLX Titanium) and assembled with the chloroplast genome of N. tabacum as reference sequence. Pyrosequencing has recently been used to determine the entire chloroplast sequence from, for example, Phoenix (Yang et al. 2010). When we started this research, the complete chloroplast genome sequence of the Malus × domestica and P. persica had not yet been released. As a result, there were 151 gap regions in the assembled sequences. Those regions were completed by Sanger sequencing using 99 primer pairs (ESM Table 1). If we assembled with the chloroplast genome of the other rosaceous, plants such as Malus or Prunus, number of the gaps might have been minimalized.

Homopolymers are stretches of the same nucleotide sequence and their contribution to technical sequencing errors when using 454 systems is well documented (Moore et al. 2006; Huse et al. 2007). The Pyrus chloroplast genome contained 28 uncertain homopolymers in the assembled sequence. This error is intrinsic to pyrosequencing and cannot be improved by increasing coverage.

The organization of the Pyrus chloroplast genome with two copies of an IR separating the SSC and LSC regions was identical to well-known angiosperm chloroplast genomes (Palmer 1991). The genome size, 159,919 bp, was also within the known size range for angiosperms. Gene order in the Pyrus genome was identical to that in N. tabacum. The gene content of Pyrus was also very similar to most other angiosperm chloroplast genomes (Raubeson and Jansen 2005). Chloroplast DNAs can be classified into three groups: chloroplast DNAs lacking IRs (group I), chloroplast DNAs containing IRs (group II), and chloroplast DNAs with tandem repeats (group III) (Sugiura 1992). Almost all algae and higher plants belong to group II. In this group, N. tabacum is the most representative of land plants, which probably reflects the ancestral gene order among higher plants. Therefore, the Pyrus chloroplast genome can be considered as a standard type in higher plants.

A total of 220 indels of more than 10 bp were found in Pyrus by a comparison between Pyrus, Malus, and Prunus (Table 2). Most of these indels may have originated from slipped-strand mispairing of surrounding sequences (Levinson and Gutman 1987) or illegitimatic recombination events (Milligan et al. 1989; Ogihara et al. 1988; Shimada and Sugiura 1989). Only five indels were seen in the IR region. The sequences in the IR regions diverged at slower rates compared to the sequences of the LSC and SSC regions (Kim and Lee 2004). The results of this present study are consistent with this and support the idea that the stabilizing effect of the IR regions by genetic recombination is the main cause of their sequence conservation.

The largest indel in Pyrus was a 229 bp deletion (compared with Malus) between accD-psaI and was revealed by RFLP analysis based on physical mapping and DNA sequencing (Katayama and Uematsu 2003; Katayama et al. 2012). The intergenic region between rbcL-cemA which includes accD-psaI has previously been reported as a hypervariable region and might represent an intra molecular recombinational hot spot mediated by short direct repeats in the chloroplast DNA of Poaceae and Rosaceae (Ogihara et al. 1988; Shaw et al. 2007).

The indels that we have discovered might have numerous, important applications in systematics and evolutionary biology such as elucidating the origin of domesticated species (Wills and Burk 2006), tracing biogeographic movements (Ickert-Bond and Wen 2006; Schonswetter et al. 2006a, b), and clarifying complex relationships among species (Shaw and Small 2005). Sixteen unique indels to Pyrus were detected by comparing the chloroplast genomes of the Rosaceae Pyrus, Malus, and Prunus (Table 2). It is highly likely that these indels will be very useful as intraspecific DNA markers in Pyrus.

By comparing 89 noncoding chloroplast DNA regions in Pyrus and Malus, we identified highly variable regions such as intergenic spacers of ndhC-trnV and trnR-atpA (Fig. 4). Previously, Shaw et al. (2007) reported 13 highly variable regions in the chloroplast genome of seven angiosperms including Prunus; intergenic spacer ndhC-trnV was one of these. However, highly variable region such as intergenic spacer trnR-atpA was not reported. The aligned sequence length of these regions was 674 and 887 bp, respectively (data not shown). The highly variable regions such as intergenic spacers of ndhC-trnV and trnR-atpA will be useful as intraspecific DNA markers. Katayama and Uematsu (2005) reported a hypervariable region was located within a region of about 9.1 kb which includes intergenic spacer psbA-atpA in the left border of LSC of Prunus chloroplast genome in the physical mapping. In the present study, the same trend was apparent within a region of about 10.1 kb which includes intergenic spacer trnH-atpA (position 77–10,979) in comparison of chloroplast genomes in Pyrus and Malus (Fig. 4). This highly variable region may be specific to Rosaceae.

The border between the two IR/LSC and the two IR/SSC regions usually varies between species as one of the highly variable regions in chloroplast genome, even between closely related genera of the same family (Kim and Lee 2004). Expansions and contractions of IR regions often create the large length variations between chloroplast genomes (Palmer et al. 1988; Raubeson and Jansen 2005; Wakasugi et al. 1994). The expansions/contractions of IR are probably mediated by gene conversion and recombinational repair of double-strand breaks (Goulding et al. 1996). In the present study, detailed comparisons of the IR boundaries in five genera highlighted the wide ranges of expansions and contractions possible in the IR (Fig. 5). In the comparison between Pyrus, Malus, and Prunus, the IR/LSC borders were more variable than the IR/SSC borders. In Prunus, the IR/LSC borders were 62 bp longer than those of Pyrus. In Malus, the length between the IRa/LSC junction and trnH was 69 bp shorter than in Pyrus (Fig. 6). It may be possible to use the differences directly as a DNA marker, which would be a useful evolutionary tool at both the intra- and interspecific levels.

A total of 67 SSR loci were identified in the Pyrus chloroplast genome. To date, chloroplast SSRs have been detected in Pinus radiata (Cato and Richardson 1996; Powell et al. 1995), O. sativa (Ishii et al. 2001), Panax schinseng (Kim and Lee 2004), C. sativus (Kim et al. 2006), and Vigna radiate (Tangphatsornruang et al. 2010). These chloroplast microsatellites can be useful in ecological and evolutionary studies because they are so variable at the inter- and intrapopulation levels. Therefore, Pyrus chloroplast microsatellites will be useful tools too.

There has been a rapid increase in the number of studies using DNA sequences from completely sequenced chloroplast genomes for estimating phylogenetic relationships among angiosperms (Bausher et al. 2006; Goremykin et al. 2005; Jansen et al. 2006, 2007; Leebens-Mack et al. 2005; Ravi et al. 2006; Ruhlman et al. 2006). The phylogenetic analyses reported here with expanded taxon sampling, were consistent with these earlier studies so our discussion will focus on relationships among Rosales (Fig. 7). The phylogenetic trees in this study indicate close relationships between Pyrus and Prunus with high bootstrap support (100%). These results agree with data confirmed by phylogenetic methods based on matK and trnL-trnF (Potter et al. 2002). Whole genome sequence of Malus in Pyreae could not be used for phylogenetic analysis in this study because updated online sequence of Malus was not annotated yet. Additional chloroplast genome sequences such as Spiraeae and Rosoideae are needed to generate a Rosaceae phylogeny based on whole genomes.