Introduction

The monocot order Poales comprises 16 families and approximately 18,000 species. Relationships among families are generally well resolved and supported (Guisinger et al. 2010). The largest family within the Poales, the Poaceae, has provided the basis for many studies, including complete chloroplast genome sequencing, owing to its ecological, economic, and evolutionary importance (Guisinger et al. 2010).

Cultivated pineapple (Ananas comosus (L.) Merr.) belongs to the family Bromeliaceae in the Poales. Pineapple is the third most important tropical fruit in world production after banana and citrus (Rohrbach et al. 2002). It has been cultivated for more than 500 years in the Americas. Domesticated pineapple was already widely distributed in the Americas and the Caribbean prior to the arrival of Columbus on 1493 (Rohrbach et al. 2002). Pineapple has the crassulacean acid metabolism (CAM) photosynthetic pathway (Malézieux et al. 2002). CAM plants conserve water by conducting most of their gas exchange in the relatively cool atmosphere at night, and CAM plants can grow in strongly water-limited semidesert habitats (West-Eberhard et al. 2011). Such strong drought tolerance also enables months-long storage of vegetative propagules of pineapple (Hepton 2002). Strong drought tolerance itself and ease of transport due to long-life vegetative propagules facilitated its wide diffusion throughout the tropics. Ananas includes two major species, A. macrodontes and A. comosus. The latter has five botanical varieties: A. comosus var. bracteatus, var. parguazensis, var. comosus, var. ananassoides, and var. erectifolius (Coppens d Eeckenbrugge et al. 2002). Only var. comosus has edible cultivars. To clarify the phylogeny of Ananas, analyses of morphological characters and DNA markers have been performed (Coppens d’Eeckenbrugge et al. 1997; Duval et al. 2001, 2003; Paz et al. 2005, 2012; Hamdan et al. 2013). Mexican and Cuban pineapples were characterized by using an amplified-fragment-length polymorphism method (Paz et al. 2005, 2012). Phylogenetic analysis of Malaysian cultivars was performed using the chloroplast-encoded rbcL sequence (Hamdan et al. 2013). Both restriction-fragment-length polymorphism markers and chloroplast genotypes were used to study genetic diversity in Ananas (Duval et al. 2001, 2003). So far, however, few chloroplast-derived markers have been used to study evolutionary relationships of Ananas (Hamdan et al. 2013).

In this study, we determined the complete nucleotide sequence of the chloroplast genome of A. comosus var. comosus, using next-generation sequencers. We compared it with other sequenced chloroplast genomes and discuss structural differences in the form of indels and microsatellites.

Materials and methods

Plant materials and DNA extraction

The pineapple cultivar ‘N67-10’ grown at Nago Branch of Okinawa Prefectural Agricultural Research Center (Nago, Okinawa, Japan) was used. Unexpanded young leaves in the crown were collected and applied for DNA extraction. Total DNA was extracted with a DNeasy Plant Mini Kit (Qiagen, Hilden, Germany) according to the manufacturer’s instructions.

DNA sequencing

For the 454 GS FLX+ genome sequencer (Roche Diagnostics, Basel, Switzerland), the total genomic DNA of ‘N67-10’ was sheared by nebulization (600–900 bp in length). A rapid library was prepared with a GS FLX Titanium Rapid Library Preparation Kit (Roche) using the sheared DNA fragments. Then, a library was clonally amplified with emulsion PCR with GS FLX Titanium LV emPCR kit (Roche). Purified beads with an amplified library were applied to DNA sequencing by the 454 GS FLX+ genome sequencer. Two runs of single-read pyrosequencing were performed. For the HiSeq 2500 sequencer (Illumina, San Diego, CA, USA), the total genomic DNA of ‘N67-10’ was fragmented to 350 bp using a Covaris M220 (Covaris, Woburn, MA) and a paired-end library was prepared with a TruSeq DNA LT Sample Prep Kit (Illumina). A library was sequenced by the HiSeq2500. The paired-end read length was 100 bp. All experiments were performed according to the manufacturer’s instructions.

De novo assembly of 454 GS FLX+ data

The 454 GS FLX+ sequence reads were assembled in the CLC Genomics Workbench 7.0 de novo assembly program (Qiagen). We sorted the assembled contiguous sequences (contigs) by depth of coverage to distinguish the chloroplast genome (>50×) from the mitochondrial and nuclear genomes (<50×). We confirmed the high-coverage (>50×) contigs as chloroplast genome by BLAST search against the nucleotide collection database (nr/nt) of the National Center for Biotechnology Information (NCBI, http://www.ncbi.nlm.nih.gov/). In addition, we confirmed that the Typha latifolia chloroplast genome (GI: 289065068) showed the highest similarity to that of pineapple among the registered chloroplast genomes. We mapped the >50× contig sequences against the T. latifolia sequence by BLASTN. Fourteen contigs were mapped on the T. latifolia chloroplast genome, and gaps were filled in with sequence reads with at least 50 bp of continuous perfect match from both ends.

Read-mapping and correction of draft genome sequences

After the circular draft genome was assembled, we mapped the Illumina reads to it in CLC Genomic Workbench software to find and correct ambiguous nucleotides. The HiSeq 2500 reads used the mapping parameters length fraction = 1.00 and similarity fraction = 1.00. Low-coverage sites were assumed to be errors, and these misassembled sites were manually corrected against HiSeq 2500 sequence reads. After all of the ambiguous nucleotides were corrected, encoded genes were annotated by using DOGMA (Dual Organellar GenoMe Annotator, http://dogma.ccbb.utexas.edu; Wyman et al. 2004). The circular map of the chloroplast genome was drawn by the GenomeVx program (Conant and Wolfe 2008).

Comparative analysis of organelle genomes

The chloroplast genome sequences of T. latifolia and Musa acuminata (GI: 525312436) were compared with that of A. comosus. Dot-plot analysis in PipMaker software (Schwartz et al. 2000) used default settings. The lengths of indels were assessed by creating alignments of complete chloroplast genome sequences in CLC Genomics Workbench.

Phylogenetic analyses

Phylogenetic analyses were performed on an aligned data matrix of 62 angiosperm taxa and 76 protein-coding genes (atpA, atpB, atpE, atpF, atpH, atpI, ccsA, cemA, clpP, infA, matK, ndhA, ndhB, ndhC, ndhD, ndhE, ndhF, ndhG, ndhH, ndhI, ndhJ, ndhK, petA, petB, petD, petG, petL, petN, psaA, psaB, psaC, psaI, psaJ, psbA, psbB, psbC, psbD, psbE, psbF, psbH, psbI, psbJ, psbK, psbL, psbM, psbN, psbT, psbZ, rbcL, rpl14, rpl16, rpl2, rpl20, rpl22, rpl23, rpl32, rpl33, rpl36, rpoA, rpoB, rpoC1, rpoC2, rps11, rps12, rps14, rps15, rps16, rps18, rps19, rps2, rps3, rps4, rps7, rps8, ycf3, ycf4). Amino acid sequences were aligned by using the Multiple Sequence Web Viewer and Alignment Tool (http://mswat.ccbb.utexas.edu), manually adjusted, and then manually concatenated. The best-scoring maximum likelihood tree was constructed from the sequences in RAxML ver. 8.0.19 software with the PROTCATWAG model (Stamatakis 2014). The likelihood bootstrap probability of each branch was calculated in the “rapid bootstrap” algorithm of RAxML using 1000 replicates.

Simple sequence repeats

Simple sequence repeats (SSRs) were searched for by using the microsatellite search tool MISA (http://pgrc.ipk-gatersleben.de/misa/misa.html).

Results

Genome assembly and validation

Sequencing on the 454 GS FLX+ system generated a total of 1,340,605 reads with an average length of 568 bases that covered 761 Mb. After cleaning and trimming, the remaining reads (1,163,292 reads with an average length of 466 bases) were assembled. Fourteen generated contigs were mapped on the Typha chloroplast genome, and gaps were filled with sequence reads. Mapping the HiSeq 2500 reads onto the resultant supercontig and treating low-coverage sites as sequence errors detected 85 errors, which were corrected.

Size and gene content of the Ananas chloroplast genome

The total length of the constructed Ananas chloroplast genome was 159,636 bp and included a large single-copy (LSC) region of 87,466 bp, a small single-copy (SSC) region of 18,622 bp, and a pair of inverted repeats (IRa and IRb) of 26,774 bp each (Fig. 1). The genome contained 113 unique genes, 19 of which were duplicated in the IRs, giving a total of 132 genes (Table 1). Among 4 rRNA and 30 tRNA genes identified, all 4 rRNA and 8 tRNA genes were duplicated in the IR. The tRNA genes were identical to those of well characterized vascular plants. The genome consisted of 59.80 % coding regions and 40.20 % noncoding regions, including both intergenic spacers and introns. It had a GC content of 37.37 % and an AT content of 62.63 %.

Fig. 1
figure 1

Gene map of the Ananas comosus chloroplast genome. The thick line indicates the inverted repeats (IRa and IRb), which separate the genome into small single-copy (SSC) and large single-copy (LSC) regions. Genes on the outside of the map are transcribed in the clockwise direction and genes on the inside in the counterclockwise direction. Genes containing introns are marked with an asterisk

Table 1 Genes in the chloroplast DNA of Ananas comosus

Simple sequence repeats

We identified 65 SSR regions with ≥10 repeated nucleotides (Table 2): 9 with dinucleotide repeat motifs of AT or TA, 23 A stretches (10–17 bases), and 33 T stretches (10–16 bases), but no C or G stretches. Of the 65 SSR regions, 46 were in intergenic spacers, 12 within introns, and 7 in gene-coding regions.

Table 2 Distribution of simple sequence repeat (SSR) loci in the Ananas comosus chloroplast genome

Phylogenetic analysis

Phylogenetic analyses were performed on an aligned data matrix of 62 angiosperm taxa and 76 protein-coding genes with a total length of 20,038 amino acids aligned. Bootstrap analysis indicated that 41 out of the 57 nodes were significantly supported (≥95 %; Fig. 2). The results were in good accordance with previously reported relationships among the major groups of vascular plants (Jansen et al. 2007). The analysis suggests that Ananas (Poales) shows a close phylogenetic relationship to Typha (Poales) and Musa (Zingiberales).

Fig. 2
figure 2

Phylogenetic tree inferred by RAxML using nucleotide sequences of 76 protein-coding genes shared between 62 angiosperm chloroplast genomes. Numbers above nodes indicate bootstrap values

Comparison of the Ananas chloroplast genome with those of Typha and Musa

Dot-plot analysis showed similar gene order and organization in Ananas and Typha (Fig. 3a). It revealed three insertions and four deletions of >200 bp in Ananas (shown as breakpoints in Fig. 3a), all in intergenic spacer regions. Coding regions had only two deletions of >10 bp: 15 bp in accD and 24 bp in ycf2 (Table 3). No missense indels between Typha and Ananas were found.

Fig. 3
figure 3

a, b Dot-plots. a Ananas comosus versus Typha latifolia. b A. comosus versus Musa acuminata. Dot-plots show indels of >200 bp in Ananas compared with Typha or Musa

Table 3 Indels within coding regions

Compared with Musa, however, there were four insertions and six deletions of >200 bp in Ananas (Fig. 3b). Three common indels in Ananas differed from those in Typha and Musa: insertions in rpoB–trnC_GCA and ndhF–rpl32 and a deletion in trnE_UUC–trnT_GGU. A large deletion of 7807 bp in the AnanasMusa dot-plot shows as a large disconnection (Fig. 3b). Coding regions had 30 indels of >10 bp (Table 3). Two 5-bp missense indels were found in rpl16 and rps19 (Table 3). The 5-bp insertion in rpl16 at positions 6–10 from the 3′ end of the coding sequence in Ananas created a termination signal (TAG) 8 bp earlier than in Musa (Fig. 4). The 5-bp deletion in rps19 of Ananas corresponding to positions 2–6 in Musa changed the initiation codon (ATG) to GTG in Ananas (Fig. 5).

Fig. 4
figure 4

Missense insertion in Ananas compared with Musa in the 3′ end region of rpl16. Arrows indicate rpl16 coding region. Insertion of 5 bp in Ananas caused a 8-bp shortening of the 3′ end of the coding sequence

Fig. 5
figure 5

Missense deletion in Ananas compared with Musa in the 5′ end region of rps19. Arrows indicate rps19 coding region. Initiation codon is GTG in Ananas but ATG in Musa

Figure 6 shows details of IR–SC border positions with respect to adjacent genes in Ananas, Typha, and Musa. Lengths of LSC, IR, and SSC were similar in Ananas and Typha. On the other hand, whereas lengths of LSC were similar in Ananas and Musa, IR of Ananas was 8659 bp shorter and SSC of Ananas was 7854 bp longer than those of Musa (Fig. 6). In Ananas, the IRa/SSC border occurred in the 3′ region of ycf1 and created a ycf1 pseudogene of 1089 bp. Typha showed a similar structure. In Musa, the IRa/SSC border occurred in the 3′ region of ndhA and created an ndhA pseudogene. In Ananas, Typha, and Musa, the IRa/LSC border occurred downstream of the noncoding region of psbA, and the IRb/LSC border occurred upstream of the noncoding region of rpl22.

Fig. 6
figure 6

Detailed view of the inverted repeat—single-copy (IR/SC) border regions among three chloroplast genomes. Annotated genes or portions of genes are indicated by gray boxes above or below the genome

Discussion

We determined the complete nucleotide sequence of the Ananas chloroplast genome using only next-generation sequencers instead of Sanger sequencing method. Since no perfect assembler program has been created so far, de novo assembly always generates misassembled contigs, and thus assembled contigs must be checked by read-mapping and be scanned for any gaps of lower coverage (Naito et al. 2013). Most of the errors corrected by the HiSeq 2500 sequencing were homopolymer stretches (data not shown), which are likely when the 454 GS FLX system is used (Gilles et al. 2011).

We identified 65 SSRs in the Ananas chloroplast genome. To date, chloroplast SSRs have been detected in Pinus radiata (Cato and Richardson 1996; Powell et al. 1995), Oryza sativa (Ishii et al. 2001), Panax ginseng (Kim and Lee 2004), Cucumis sativus (Kim et al. 2006), Vigna radiata (Tangphatsornruang et al. 2010), and Pyrus pyrifolia (Terakami et al. 2012). These SSRs can be useful in evolutionary studies because of their variability at the inter- and intrapopulation levels. We could not indicate phylogenetic data of Ananas for validation of SSRs here. Future research will need to focus on the validity of SSRs to phylogenetic and ecological studies of Ananas.

There has been a rapid increase in the number of studies using DNA sequences from completely sequenced chloroplast genomes for estimating phylogenetic relationships among angiosperms (Goremykin et al. 2005; Leebens-Mack et al. 2005; Bausher et al. 2006; Jansen et al. 2006, 2007; Ravi et al. 2006; Ruhlman et al. 2006). Our phylogenetic tree indicates a close relationship between Ananas and Typha with high bootstrap support (100 %). The phylogenetic tree identified Ananas as a basal member of the Poales, closer to Musa than to species of the Poaceae. These results are in good accordance with data revealed by phylogenetic methods based on the rbcL sequence (Bremer 2000).

The Ananas chloroplast genome structure is similar to that of Typha. Within the Poales, members of the Poaceae have a smaller chloroplast genome size, with several alterations such as large inversions in the LSC and indels, than that of Typha (Katayama and Ogihara 1996; Guisinger et al. 2010). The similar LSC, SSC, and IR sizes of Ananas to those of Typha and the absence of an inversion in the LSC of Ananas strongly indicate that Ananas and Typha are closely related among the Poales and are phylogenetically far from the Poaceae. On the other hand, the chloroplast genomes of Ananas and Musa show many structural differences. That of Ananas has an 8659-bp shorter IR and a 7854-bp longer SSC than that of Musa and is 10 kb smaller overall than that of Musa. Martin et al. (2013) suggested that the expansion of IRa to the SSC junction resulted in the incorporation of ycf1, rps15, ndhH, and ndhA in IRa of Musa. An idea that occurrence of deletion of IRb and the change of IRa to SSC in Ananas and Typha was not supported because such extreme IR expansion was not observed in other species and might have occurred independently only in the Musaceae (Martin et al. 2013).

Most indels of >200 bp between Ananas and Typha or Musa were located in intergenic spacer regions. Insertions in rpoB–trnC_GCA and ndhF–rpl32 and a deletion in trnE_UUC–trnT_GGU in Ananas appeared in both comparisons. Therefore, these large indels seem to be specific to Ananas. A large (291-bp) deletion from Musa occurred in the coding region of accD. The accD protein in most monocots is around 500 amino acids, for example, 491 in Phoenix (GenBank ID: ADF28155.1) and Cocos (AGS43475.1), 482 in Oncidium (ACT83118.1), and 489 in Lilium (AGQ55767.1). The Ananas accD comprised 488 amino acids, whereas the Musa accD comprised 599 (CCW72384.1). Thus, the length of the Ananas accD is consistent with that in other monocots, but the Musa accD is much longer, suggesting a Musa-specific DNA insertion. Indels in coding regions between Ananas and Musa occurred especially frequently in ycf1 and ycf2 (Table 3). ycf1 and ycf2 show a wide range of length variation among species and are absent in the Poaceae (Asano et al. 2004; Chang et al. 2006; Hiratsuka et al. 1989; Leebens-Mack et al. 2005; Maier et al. 1995; Ogihara et al. 2000). These results suggest that alterations to ycf1 and ycf2 are nonfatal, and that indels occur comparatively easily in ycf1 and ycf2. Missense indels in rpl16 and rps19 were found in Ananas. The insertion in the 3′ end of the rpl16 coding sequence seemed not to influence the protein function because the region is not conserved region among species. The deletion in rps19 changed the initiation codon, ATG, to GTG in Ananas. GTG occurs in rps19 in various seed plant species (Raubeson et al. 2007), and among the monocot species used in the phylogenic tree, only Musa has an ATG in rps19 (Asano et al. 2004; Chang et al. 2006; Hiratsuka et al. 1989; Leebens-Mack et al. 2005; Maier et al. 1995; Martin et al. 2013; Ogihara et al. 2000). Therefore, this alteration was not critical to protein function either.

The complete chloroplast nucleotide sequence of Ananas and the structural and sequence differences between Ananas and other species that we present here will contribute to ecological and evolutional studies.