Introduction

The chloroplast (cp) is the photosynthetic organelle, which is hypothesized to have arisen from ancient endosymbiotic cyanobacteria, which provides essential energy for plants and algae. In angiosperms, most cp genomes are typically circular DNA, containing a pair of inverted repeats (IRs), one large single-copy region (LSC) and one small single-copy region (SSC) (Jansen et al. 2005; Wu et al. 2009). The size of cp genomes in higher plants ranges from 120 kb to more than 160 kb due to the expansion of IR regions and evolutionary contractions (Wang et al. 2008). The number of complete sequenced cp genomes has increased significantly, since the first cp genome was determined from Nicotiana tabacum (Shinozaki et al. 1986). The cp genome is useful in plant systematics research because of its maternal inheritance and highly conserved structures, and recently, plant phylogenetic studies on the conservation of cp genes have been conducted using partial or whole cp genomes (Asheesh and Vinay 2012).

Moraceae is cosmopolitan in distribution and contains many deciduous trees that have multiple uses (Nyree et al. 2005). The positions of Moraceae were once accorded to the subclass Hamamelidae (Order: Urticales) (http://plants.usda.gov/), but now, which has been repositioned to Order Rosales in Fabidae (also known as Rosid I) according to some nuclear genes or cp genomes (Su et al. 2014; Zhang et al. 2011). Fabidae is considered an N2-fixing clade and contains the four orders Rosales, Cucurbitales, Fagales, and Fabales. Many phylogenetic analyses of this clade indicated a single origin for the angiosperms’ predisposition to fix symbiotic nitrogen (Soltis et al. 1995; Werner et al. 2015).

Mulberry (Family: Moraceae) is an economically important food crop for the domesticated silkworm. The mulberry genome had been sequenced, and the phylogenetic position of mulberry based on single-copy mulberry genes and other 12 sequenced plants indicated that Moraceae has the closest relationship with Rosaceae (He et al. 2013). Morus mongolica is commonly known as a wild type of mulberry, which is native to China and North Korea, and it is also a very important breeding material because of its strong resistance to drought and cold, as well as its hard wood (Clement and Weiblen 2009). Here, we report the cp genome sequence of M. mongolica, and present a comparative analysis between M. mongolica and M. indica, the only one complete cp genome available in Moraceae (Ravi et al. 2006). The genome structure, insertions and deletions (indels), repeat sequences, gene order, and phylogenetics were analyzed.

Materials and methods

Sample collection, genome sequencing, and assembly

Based on its morphological characteristics, the M. mongolica species, in its natural habitat of the Tsinling Mountains (106°55′19″E, 34°14′29″N), was confirmed by Professors Bairen Zhang (Ankang Institute of Agricultural Sciences) and Yunwu Peng (Ankang University). The plant was then transplanted and preserved in the mulberry field of the Key Sericultural Laboratory of Shaanxi, Ankang University.

Approximately, 20 g fresh leaves were sampled from a single M. mongolica plant, and the cpDNA was extracted using a modified high salt method (Shi et al. 2012). After the cpDNA isolation, approximately, 5–10 μg of cpDNA was sheared, followed by adapter ligation and library amplification. Then, the fragmented cpDNAs were subjected to Illumina Sample Preparation Instructions, and single-read sequenced using the Illumina Hiseq 2000 platform. The obtained sequences were assembled using SOAP de novo software (Luo et al. 2012) and reference-based approaches in parallel. The obtained cp genome regions with ambiguous alignments were manually trimmed and considered as gaps. Gaps were filled using the read overhangs at margins and the PCR method. The process was repeated fold, and the minimum fold of the final assembled M. mongolica cpDNA reached approximately 1126-fold.

Genome annotation

The tRNA, rRNA, and protein-coding genes (PCGs) in the assembled genome were predicted and annotated using Dual Organellar GenoMe Annotator with default parameters (Wyman et al. 2004). The locations of tRNAs were then confirmed using tRNAscan-SE software, version 1.21, specifying mito/cpDNA as the source (Lowe and Eddy 1997). The rRNA genes were verified using BLASTN searches against the database of published cp genomes. The positions of the start and stop codons, or intron junctions of PCGs, were verified using the BLASTN searches and sequin program with Plastid genetic code. The gene map was drawn using OGDraw v1.2 (Lohse et al. 2007).

Genome comparison and sequence analysis

To identify the differences between the M. mongolica and M. indica cp genome, the pairwise alignments of the two sequences were performed using Clustal X 1.83. MISA was used to detect the simple sequence repeats (SSRs) with minimal repeat numbers of 10, 5, 4, 3, 3, and 3 for mono-, di-, tri-, tetra-, penta-, and hexa-nucleotides, respectively (Thiel et al. 2003).

Phylogenetic analysis

To illustrate the phylogenetic relationship of the Fabidae clade with other major Rosid clades, the other 61 complete cp genomes were downloaded from GenBank (Table S1). Nicotiana tabacum and Solanum bulbocastanum from Solanaceae were used as outgroups. Some genes missing from a species were examined in each plastome using BLAST searches (e-value cutoff = 1e−10). Then, 60 PCGs found in all of the species were extracted from the selected cp genomes. The amino acid sequences of each of the 60 cp PCGs were aligned using MSWAT (http://mswat.ccbb.utexas.edu/) with default settings, and back translated to nucleotide sequences. Phylogenetic analyses were performed using the concatenated nucleotide sequences and RAxML 7.2.6 software by the maximum likelihood (ML) method (Stamatakis 2006). RAxML searches relied on the General Time Reversible model of nucleotide substitution with the gamma rate model (GTRGAMMA). A bootstrap analysis was performed with 1000 replications using non-parametric bootstrapping as implemented in the “fast bootstrap” algorithm. The absent genes were finally mapped on to the ML phylogenetic tree.

Results

The overall structure and general features of the M. mongolica cp genome

The cp genome of M. mongolica is a closed circular molecule of 158,459 bp (GenBank accession number: KM491711), composed of a pair of IR regions (IRa and IRb) of 25,678 bp, one LSC region of 87,363 bp, and one SSC region of 19,736 bp. It has an overall typical quadripartite structure that resembles the majority of land plant cp genomes. The GC contents of the LSC, SSC, and IR regions, and the whole cp genome are 34.0, 29.3, 42.9, and 36.3 %, respectively (Table 1), which are similar to those of the other reported Rosids cp genomes (Su et al. 2014). The cp genome encodes 133 predicted functional genes, including 88 PCGs of 80,514 bp, and 37 tRNA and 8 rRNA genes of 11,865 bp. Among the 133 genes, 114 are unique genes and 19 are duplicated genes in the IR regions. We found that 16 genes have one intron (10 PCGs and 6 tRNA genes) and 2 PCGs have two introns (clpP and ycf3). Like most other land plants, a maturase K gene (matK) is located within the intron of trnK, rps12 is trans-spliced, with its two 3′ end residues separated by an intron in the IR region, and the 5′ end exon is in the LSC region (Fig. 1). The 8 rRNA genes were composed of two identical copies of 16S-23S-4.5S-5S rRNA gene clusters in the IR region. Each cluster was interrupted by two tRNA genes, trnI and trnA, in the 16S-23S spacer region.

Table 1 Summary of the M. mongolica and M. indica chloroplast genome features
Fig. 1
figure 1

Gene map of the M. mongolica (KM491711), genes lying outside of the outer layer circle are transcribed in the counterclockwise direction, whereas genes inside are transcribed in the clockwise direction. The colored bars indicate known different functional groups. Area dashed darker gray in the inner circle denotes GC content while the lighter gray shows to AT content of the genome. LSC large single-copy, SSC small-single-copy, IR inverted repeat

Comparison of M. mongolica and M. indica cp genomes

A comparative analysis between M. mongolica and M. indica cp genomes revealed that the sequence similarity between the rps4-trnT, trnT-trnL, and trnL-trnF intergenic regions and the intron of trnL is very low (can not be detected using default parameters). Thus, further identification was conducted by sequencing the PCR product of a specifically designed primer pair (Table S2). The results showed that the sequences of M. mongolica and M. indica were 2076 and 1920 bp in length, respectively, which comprised 1.3 and 1.2 % of their genomes (Fig. 1). However, they were responsible for 88 % of the total variation between the two cp genomes. As with the hypervariable region found between Asian and American Equisetum arvense (Kim and Kim 2014), Aegilops speltoides and Aegilops ovata (Guo and Toru 2005), the difference happened mainly in the intergenic spacer (IGS) or intron, and may provide essential information on the evolution of cp phylogenomic studies.

Further, indel analyses in M. mongolica relative to M. indica cp genomes showed that there were 64 indels (217 bp), consisting of 24 insertions (39 bp) and 40 deletions (178 bp). The largest insertion and deletion were 7 and 72 bp, respectively (Table 2). Of the 64 indel events, 41 (64 %) were single base indels, which is similar the percentages in maize and sugarcane (Yamane et al. 2006), and the case in genus citrus, in which, single-base indels are the most frequent ones (Carbonell-Caballero et al. 2015). All of the indel events occurred in the intron and IGS, and had no influence on the gene functions, which showed that non-coding regions were less protected than functional genes.

Table 2 Indels in M. mongolica relative to M. indica chloroplast genome

A total of 293 base substitution events (266 in the LSC region and 27 in the SSC region) and a 2.5:1:0 base substitution numbers per unit length ratio in the LSC:SSC:IR region were found in the M. mongolica cp genome (Table 3), which accounted for 0.18 % of the difference in length. Of the 293 sites, 147 and 146 were transitional (Ts) and transversional changes (Tv), respectively, showing a higher Ts/Tv bias (≈1.0). Meanwhile, the distribution of the base substitutions and the Ts/Tv ratio was uneven. There were 53 in the PCGs with the highest Ts/Tv ratio (2.3), 29 in the introns with a median Ts/Tv ratio (1.6), and 207 in the IGS, which accounted for 71 %, with the lowest Ts/Tv ratio (0.81). In addition, there were four in the tRNA gene. Of the 53 substitutions in the PCGs, 30 were concentrated in psaB (18 Ts and 4 Tv) and rps14 (6 Ts and 2 Tv), and the other 23 substitutions were distributed in 19 genes, with no more than 3 in a gene (Table 3).

Table 3 Base substitutions between M. mongolica and M. indica chloroplast genome

Analyses of simple sequence repeats (SSRs)

A total of 78 SSR loci, harboring 742 bp in length, were detected in the M. mongolica cp genome, and there were 58, 6, 2, 10, and 2 mono-, di-, tri-, tetra-, and penta-nucleotide repeats, respectively. Most of the SSRs are mononucleotide repeats, which are consistent with the study of George et al. (2015). All of the mononucleotides and 14 other SSRs were composed of A and T nucleotides, with a higher AT content (99.7 %) in these sequences than in the genome. Among the SSRs, 43 were located in IGS regions and 10 were found in coding genes, including atpF, cemA, ndhF, rpoC2, atpB, rpoB, and ycf1. Compared with M. indica, 52 loci were identity, 22 exhibited length polymorphisms, 3 were not found, and 1 locus exhibited the nucleotide content polymorphism (ATTTC)3 in M. mongolica and (TTTCT)3 in M. indica (Table 4).

Table 4 SSRs in M. mongolica and comparison with M. indica

Phylogenetic analysis

In this study, the concatenated nucleotide sequences of 60 PCGs of 62 cp genomes of Rosid clade were used to reconstruct the phylogenetic relationships by the ML method. In total, 45,327 nucleotide positions were analyzed, and the best scored ML tree (final ML optimization likelihood = −416381.612810) with bootstrap support values is depicted in Fig. 2. The bootstrap values in most of the nodes (55 of 61) were greater than 90 %, or even 100 % (total 51). Moraceae and Rosaceae were included in Rosales, the sister group of Fagales and Cucurbitales, which is consistent with the early studies (Su et al. 2014; He et al. 2013). The Fabidae clade, which is monophyletic, is composed of the orders Rosales, Cucurbitales, Fagales, and Fabales, and had a single origin for symbiotic nitrogen fixation.

Fig. 2
figure 2

Phylogenetic analysis of the rosid clade using shared PCGs by the ML method, two solanaceae plant is included as the outgroup to root the tree. All bootstrap supports are indicated near the node. The putative events of gene losses are indicated by colored rectangle

Gene content analysis in the selected species showed that six genes had been lost by different degrees. According to Millen et al. (2001), the translation initiation factor 1 infA was lost from the cp during angiosperm evolution. In this study, infA was lost in most of the Rosids, and existed in only Castanopsis echinocarpa and Quercus rubra from Fagales, Cucumis melo subsp. Melo, and Cucumis sativus from Cucurbitales, and Hypseocharis bilobata from Geraniales. The rps16 gene was lost in 15 species, including all the seven from families Chrysobalanaceae and Salicaceae, and one, three, three, and one from Euphorbiaceae, Fabaceae, Brassicaceae, and Geraniaceae, respectively. The rpl22 gene was found to be lost in Fabaceae, Theobroma from Malvaceae, Castanea mollissima, and Trigonobalanus doichangensis from Fagaceae. In addition, rpl32 gene was lost in the three species of Salicaceae, accD was lost in Hypseocharis bilobata from Geraniaceae, and clpP was lost in Viviana marifolia from Vivianiaceae (Fig. 2).

Discussion

The earlier study has shown that M. mongolica is the wild type of mulberry and has a close relationship with M. indica (Yang et al. 2003). In this study, we determined the complete cp genome sequence of M. mongolica, and compared it with the M. indica cp genome. The length of the M. mongolica cp genome was 158,459, 25 bp shorter than that of M. indica, while encoding the identical functional genes in the same order as in M. indica. Further analyses revealed that there was one hypervariable region, 64 indels, 293 base substitutions, accounting for 1.5 % of the total length, between the two cp genomes. The differences between the two cp genomes were distributed mostly in the LSC region, few in the SSC region, and none in IR region, which was consistent with an earlier study’s conclusion that the IR region contributed greatly to the size and conservation of the cp genome (Maier et al. 1995; Guisinger et al. 2010).

The occurrence of base substitutions, also known as SNPs, could be affected by the DNA mismatch repair enzyme, DNA synthetase, and selection pressure (Seo et al. 2000). The ratio of Ts/Tv should be 0.5 in theory, but in fact, the value is often higher than that because of the genetic characteristics of codons, the corresponding pattern of codon replacements, and the genome content (Yang and Yoder 1999; Morton 2003). The ratios in the whole cp genome and IGS region were 1.0 and 0.8, respectively, while the value reached 2.3 in the coding region showing the higher Ts/Tv bias. Further analysis suggests that the AT content is different in the coding and IGS region, 66.9 and 61.0 %, respectively, so nucleotide frequency may be the one possible cause for the Ts/Tv bias (Morton 1995). Also, the cause of bias may be variable among different genes, for example, rps14 (Ts/Tv = 3) has three Ks and five Kn changes, and psaB (Ts/Tv = 4.5) has 20 Ks and two Kn changes. So, natural selection can be possible cause in some genes (Escalante et al. 1998).

Most cp genomes are quite AT rich (above 60 %), have unevenly distributed AT contents, and conserved regions with lower AT contents (Cai et al. 2006). The features of the M. mongolica cp genome are the same, and the AT contents in the whole cp genome, LSC, SSC, and IR regions are 63.7, 67.0, 70.7, and 57.1 %, respectively, with no changes occurring in the IR region of the two mulberry species. Similarly, the regions with a high AT content harbor more variations, such as the hypervariable region with 71 % and the SSRs with 99.7 % AT content. The SSR polymorphisms between M. mongolica and M. indica were all focused on A or T mutations. Meanwhile, as a gene with a higher AT content in many plants, ycf1 (71.4 % in M. mongolica) is one of the most rapid evolutionary cp genes (Barnard-Kubow et al. 2014). For hypervariable regions, the AT contents of these two sequences were higher (71 and 78 % in M. mongolica and M. indica, respectively) than the whole cp genome (63.7 %), which revealed a relationship between AT content and indels as previously described (Kelchner 2000; Yamane et al. 2006). These phenomena showed that there is a positive correlation between the AT content and sequence divergence, and a bias toward A and T changes over G and C changes in plant cp genomes.

As an effective marker for plant phylogenetic studies, the results of our analysis based on 62 cp genomes were congruent with the earlier studies that deemed the Fabidae as monophyletic (Zhang et al. 2011; He et al. 2013; Su et al. 2014). Zhang et al.(2011) used 2 nuclear genes and 10 cp loci to elaborate the phylogenetic relationships among Rosales, and their research of them indicated that Rosales is a monophyletic, and Moraceae is a sister to Rosaceae. The study of Su et al. (2014) indicated that the sequences from the four orders of Fabidae form a clade. Fabidae is considered a N2-fixing clade in many phylogenetic analyses, and the predisposition for symbiotic nitrogen fixation has a single angiosperm origin (Soltis et al. 1995; Werner et al. 2015).

In higher plants, gene loss is an ongoing process. infA, rpl22, rps16, and others have been lost in different degrees in this study. The absence of infA in most species is consistent with the results that the infA gene was found to have been independently transferred to and expressed in the nucleus (Landau et al. 2007; Millen et al. 2001). The rpl22 gene was also lost in Fabaceae and some species of Fagaceae. In the earlier studies of the legume pea, the rpl22 gene was found to be missing from its cp genome, but a functional copy of rpl22 existed in its nucleus. The phylogenetic study indicated that the loss of rpl22 from the cp occurred after it was independently transferred to the nucleus in an ancestor of all flowering plants (Gantt et al. 1991). Studies on the rps16 gene in Medicago truncatula and Populus alba indicate that the loss of rps16 from their cp genome was compensated for by the nuclear-encoded rps16, which can target the cp, as well as mitochondria, and the dual targeting of rps16 to the mitochondria and cp may have emerged before the divergence of monocots and dicots (Ueda et al. 2008). The transfer and substitution events in some genes may be important processes in the eukaryotic cell’s evolution.