Introduction

Elaeagnus mollis is known as ‘monk Tang’s flesh’ because of its seed, which contains vitamin E up to 1558.1 mg/100 g (Yao 2005), which is rare in nature. E. mollis, a small deciduous tree belongs to the genus Elaeagnus of the family Elaeagnaceae, is a relic of Quaternary glaciations in China. It is regarded as a rare woody oil plant with high economic, medicinal, edible and ecological values (Xie and Ling 1997; Liang et al. 2015). Previous studies showed that E. mollis kernels have 32.2l% protein, with 17 different amino acids. Seven of these amino acids (valine, methionine, leucine, isoleucine, phenylalanine and lysine) are essential for humans and animals (Yao 2005). Copious high-quality grease, especially linoleic acid is found in E. mollis seeds, which can be used to treat arteriosclerosis (Zhang and Zhang 2015). However, such precious resources are only distributed in the Shaanxi and Shanxi Provinces of China. The wild resources of E. mollis have been sharply reduced and urgent protection is required for the understanding of molecular mechanism.

With the rapid development of the sequencing technologies and decreasing costs associated with them, studies of plastid genomes have increased. Approximately 4354 plant plastid genomes are available in the National Center for Biotechnology Information (NCBI) nucleotide database (NCBI 2020, https://www.ncbi.nlm.nih.gov/genomes/GenomesGroup.cgi?taxid=2759&opt=plastid>), including alimentary crops, economic crops and medicinal plants. In general, plastid genomes of plants are distinguished by having a slower rate of evolution, maternal inheritance, moderate rate of nucleotide replacement and high conservation. Therefore, plastid genomes are widely used in phylogenetic analysis, DNA barcodes and molecular marker development (Jansen et al. 2007; Parks et al. 2009; Barrett et al. 2013). In medical plants, plastid genome sequencing provides the possibility for obtaining more abundant DNA molecular information, identifying species of traditional Chinese medicine, and the genetic diversion of genuine medicinal material (Lin et al. 2010).

In this study, high-throughput sequencing was used to analyse the plastid genome of E. mollis. Additionally, we also performed the comparative genomes of close relatives, which have been published in Elaeagnaceae family (Choi et al. 2015; Chen and Zhang 2017), to identify the efficiency of plastid genome for close related species and provide data support for selecting more suitable DNA barcode. Recent studies have shown that the plastid genome sequences are indispensable data for plant phylogenetic (Parks et al. 2009). Here, the phylogenetic analysis was constructed using 32 plastid genomes of Rosales, aiming to provide a better understanding of the evolution of E. mollis.

Materials and methods

Sample collection and DNA extraction

Samples of E. mollis were collected from Xiangning County (N36°02′35.25″, E111°07′57.36″) in Shanxi Province, China. Samples were promptly dried with silica gel using the cetyl trimethyl ammonium bromide (CTAB) method with slight modification. Total genomic DNA was extracted using Plant Genomic DNA Extraction kit (Tiangen, Beijing, China) according to the manufacturer's instructions.

Plastid genomes sequencing, assembly and annotation

The extracted genomic DNA was determined for concentration and quality by QuantiFlour (Promega, USA), and then subjected to high-throughput amplicon sequencing on the Illumina HiSeq 2500 platform (Biomarker Biotechnology, Beijing, China). In addition, we downloaded only two published Elaeagnaceae plastid genomic sequences (Elaeagnus macrophylla NC_028066 and Hippophae rhamnoides NC_035548) for comparative analysis. Low-quality reads were removed from raw reads using NGSQC Toolkit v. 2.3.3 using default parameters (Patel and Jain 2012). Subsequently, filtered paired-end reads were used to reconstruct the plastid genomes using the program MIRA v. 4.0.2 (Chevreux et al. 2004) and MITObim v. 1.7 (Hahn et al. 2013). To ensure accurate assembly, E. macrophylla (NC_028066) and H. rhamnoides (NC_035548) were used as references. The assembled plastid genome sequences were introduced to the program DOGMA (Wyman et al. 2004), for annotation and manual correction by comparison with published Elaeagnaceae plastid genome using Geneious R8 (Biomatters, Auckland, New Zealand). Plastid genomes were drawn with OGDRAW (http://ogdraw.mpimp-golm.mpg.de/) (Lohse et al. 2013).

Plastid genome comparison and repeat sequence analysis

In this study, a visual alignment of E. mollis, E. macrophylla and H. rhamnoides plastid genomes was generated in mVISTA (Shuffle-LAGAN mode) (Frazer et al. 2004). REPuter (Kurtz et al. 2001) online software was used to identify the dispersed and palindromic repeat sequences with parameters set as follows: (i) hamming distance of 3, (ii) maximum repeats size of 50 bp, and (iii) and minimum repeat size of 30 bp. Tandem repeat sequences >10-bp long were detected using the online program Tandem Repeats Finder (Benson 1999), with settings of 50 and 500 for the minimum alignment score and maximum period size, respectively. The alignment parameters of match, mismatch, and indel were set as 2, 7 and 7, respectively.

Codon usage bias

In total, 81 protein-coding genes (PCGs) with a length >100 bp were selected for a synonymous codon usage analysis to ensure sampling accuracy. Relative synonymous codon usage (RSCU), the proportion of the observed frequency of a codon to the expected frequency (Sharp and Li 1986), was determined using MEGA 5.0 (Tamura et al. 2011). An RSCU value less than 1.0 was considered evidence of a lack of bias, between 1.0 and 1.2 was low bias, between 1.2 and 1.3 was moderated bias, and greater than 1.3 was high bias (Zuo et al. 2017).

Phylogenetic analysis

In total, 32 plastid genomes from eight families were downloaded from NCBI GenBank nucleotide database (https://www.ncbi.nlm.nih.gov/). The reliability of phylogenetic analysis is mainly dependent on the accuracy of the sequence alignment (Morrison and Ellis 1997; Ogden and Rosenberg 2006; Hohl and Ragan 2007). Multiple alignments was conducted using the MAFFT program (Katoh and Standley 2013). Then, the best-fitting nucleotide substitution model was selected for the phylogenetic analysis using the ModelGenerator program (Keane et al. 2006). The phylogenetic tree was constructed as previously described (Huang et al. 2020). Maximum likelihood (ML) analyses were performed using the program RAxML v. 8.1.5 (Stamatakis 2014).

Results

Basic information of the plastid genomes

The plastid genomes obtained in this study for E. mollis has been deposited in the GenBank under the accession number MG386504. It is 151,354 bp long containing one LSC of 81,072 bp and one SSC of 18,560 bp, which were separated by a pair of IRs of 25,861 bp (figure 1). The guanine and cytosine (GC) content of the LSC, SSC and IR regions were 35%, 30.1%, and 42.6%, respectively. In total, 132 genes were predicted in the plastid genomes of E. mollis, including eight rRNA genes, 38 tRNA genes, and 86 protein-coding genes. In summary, 113 genes appeared in a single copy and four rRNA genes (rrn4.5, rrn5, rrn16 and rrn23), eight tRNA genes (trnA-UGC, trnH-GUG, trnI-CAU, trnI-GAU, trnL-CAA, trnN-GUU, trnR-ACG and trnV-GAC) and seven protein-coding genes (ndhB, rps7, rps12, rpl2, rpl23, ycf1 and ycf2) appeared in two IRs (table 1). Among the 113 single-copy genes, ycf1 crossed SSC and IR regions. Additionally, 15 genes (trnA-UGC, trnG-GCC, trnI-GAU, trnK-UUU, trnL-UAA, trnV-UAC, rpl2, rpl16, rps12, rpoC1, atpF, petB, petD, ndhA and ndhB) contained one intron, and two genes (ycf3 and clpP) contained two introns. The longest intron was trnK-UUU at 2825 bp (table 2).

Figure 1
figure 1

Genetic map of E. mollis chloroplast genome. The genes belong to different functional groups are encoded by different colours. The genes outside the circle are transcribed clockwise, while the genes in the circle are transcribed counter clockwise. The inner circle indicates the inverted boundaries and GC content.

Table 1 Genes found in the E. mollis plastid genome.
Table 2 Genes with introns in the E. mollis plastid genome and length of exons and introns.

Comparison with plastids in other Elaeagnaceae species

We compared the basic features of three plastid genomes obtained in our present study with two previously published Elaeagnaceae plastid genomes (E. macrophylla and H. rhamnoides). Similarly, each of the three plastid genomes contained 38 tRNA genes and eight rRNA genes. The length of the three genomes ranged from 151,354 bp (E. mollis) to 156,123 bp (H. rhamnoides). The lengths of the LSC region varied from 81,072 bp (E. mollis) to 83,331 bp (H. rhamnoides) (table 3). SSC region lengths ranged from 18,560 bp (E. mollis) to 18,831 bp (H. rhamnoides), and IR lengths varied from 25,861 bp (E. mollis) to 26,658 bp (H. rhamnoides) (table 1). The GC contents of plastid genomes of E. mollis, E. macrophylla and H. rhamnoides were 37.0%, 41.1% and 36.7%, respectively (table 3).

Table 3 The features of plastid genomes of three Elaeagnaceae species.

Forty-seven repeat sequences were identified in the plastid genomes of E. mollis, including tandem (17), dispersed (12), and palindromic (18) types. The distribution of the repetitive sequences in the plastid genomes of E. mollis, E. macrophylla and H. rhamnoides were similar but the dispersed repeats was the least (figure 2).

Figure 2
figure 2

Numbers of repeats in the plastid genome of E. mollis compared with other two species.

Contraction and expansion at the boundaries of IR region is a common evolutionary event that largely accounts for the variation in angiosperm plastid genomes sizes. These events play a crucial role in evolution (Kode et al. 2005; Raubeson et al. 2007; Yao et al. 2015). Genomic structure and size of the three plastids were highly conserved, the IR/SSC boundary regions varied slightly (figure 3). The border between the IRB and SSC encompassed the ycf1 gene, with ycf1 pseudogenes found in plastids of all three species. The length of the ycf1 pseudogene was similar in the three Elaeagnaceae species. Overlap between ndhF and ycf1 was noted in all three Elaeagnaceae species (E. mollis, E. macrophylla and H. rhamnoides), with ndhF expanding into the IRB region by 10 bp, 11 bp and 7 bp, respectively.

Figure 3
figure 3

Comparison of four junctions (LSC/IRB, IRb/SSC, SSC/IRA and IRA/LSC) among three Elaeagnaceae genome.

The SSC–IRA junctions were situated in the ycf1 coding region and the length of ycf1 in the IRA region varied in three species from 1215 bp to 1247 bp. The IRA–LSC junctions were located between trnH and psbA, and at the IRA–LSC junction, the trnH gene extended 83 bp, 118 bp, and 214 bp into the IRA region in E. mollis, E. macrophylla and H. rhamnoides, respectively. In summary, genes other than rps19 in these three species showed similar compositions at the IR-SSC and IR-LSC boundaries (figure 3).

The psbA gene was located in the LSC region in all three species. The rps19 genes in E. macrophylla and E. mollis were all found in the LSC region, whereas the IR region extended into a short rps19 pseudogene of 74 bp in H. rhamnoides. The trnH-GUG genes were all located in the IR region, and their distances from the LSC-IRB boundary were 85 bp, 118 bp, and 74 bp and from the LSC-IRA boundary were 83 bp, 118 bp, and 214 bp in E. mollis, E. macrophylla, and H. rhamnoides.

An analysis of plastid genomes is crucial for understanding the relatedness of plant species and increasingly valuable with more published DNA sequences. There have been numerous comparative analyses of plastid genomes to examine interspecific relationships and identify specific DNA barcodes in closely related plant species (Chen et al. 2012). The sequences of plastid genomes from E. mollis were compared with those from E. macrophylla and H. rhamnoides using the mVISTA program. The results showed a few substantial differences between the plastid genomes of the three Elaeagnaceae species (figure 4). These differences, which could be used as specific DNA barcodes, emerged in the intergenic regions of atpH-atpI, petN-psbM, trnT-psbD, trnP-psaJ, rpl32-trnL and ycf1. The tRNA and rRNA coding regions (light blue) were the most highly conserved, and figure 4 shows that conserved noncoding sequences (CNS) were more divergent than coding region (exon). We also found trnH-GUG duplication in the IR regions of plastids in the three species.

Figure 4
figure 4

mVISTA per cent identity plot comparing the three chloroplast genomes with E. mollis as a reference. The y-axis represents the per cent identity within 50–100%. Grey arrows and thick black lines above the alignment indicate genes with their orientation and the position of IRs. The purple regions represent exons, the light-blue regions represent rRNA or tRNA coding genes, the pink regions represent CNS, the gray regions represent mRNA, and white peaks represent differences of genomics.

Codon usage bias

The codon usage frequency in plastids genomes of E. mollis is shown in table 4. A high degree of conservation is apparent. Leucine (Leu, ~10.52%) and cysteine (Cys, ~1.4%) were the most and least used amino acids, respectively. Among the 20 amino acids, methionine (Met) and tryptophan (Trp) were encoded by only one codon, whereas the others were encoded by multiple codons. Isoleucine (Ile) were encoded by three codons, AUA, AUC and AUU, and their corresponding RSCU values were 0.97, 0.59, and 1.44, respectively. Therefore, AUU was the preferred translation codon. Bias in the use of synonymous codons for amino acids other than Met and Trp, which were similar in plastid genomes of higher plants. The results indicated that the third nucleotide of the preferred translation codons were mostly A or U.

Table 4 Synonymous codon usage in the E. mollis plastid genome.

Phylogenetic analysis of Rosales

Plastid genome sequences have been extensively used to analyse the plant phylogenies (Goremykin et al. 2015; Sun et al. 2016). Phylogenetic trees were reconstructed using ML methods in 32 species from the order Rosales. Castanea mollissima was set as outgroup to infer the phylogenetic positions of Elaeagnus within Elaeagnaceae and relationships of Rosales. Whole plastid genomes were used to construct the phylogenetic trees.

Except for the small family Barbeyaceae and no published sequence of Dirachmaceae, most Rosales families were represented by three to five samples. As shown in figure 5, Rosales families were divided into two clades with strongly supported. Rosaceae in the first clade were sister to the rest of the order. The remaining Rosales were divided into two subclades: (i) Ulmaceae, Cannabaceae, Boehmeria and Moraceae (MLBS=100%); (ii) Barbeyaceae, Elaeagnaceae, and Rhamnaceae (MLBS=100%). In the first of these subclades, Boehmeria and Moraceae were well-supported as monophyletic (MLBS=100%); Boehmeria +Moraceae, in turn, formed a clade with Cannabaceae (MLBS=100%); finally, Ulmaceae were sister to Cannabaceae, Moraceae and Urticaceae (MLBS=100%). In the second of these subclades, Rhamnaceae were sister to the rest of this clade, which formed a well-supported monophyletic group (MLBS=100%). Barbeyaceae plus Elaeagnaceae formed a clade with Rhamnaceae (MLBS=87%).

Figure 5
figure 5

Phylogenetic relationships based on complete chloroplast genome sequences of species belonging to Rosales inferred from ML analysis.

During the evolution of plants, many genes have been extirpated from plastid genomes. By far, infA is the most mobile plastid gene in plants. A study by Millen et al. (2001) suggests that many infA copies in plastid DNA was lost during angiosperm evolution. Our results showed that infA gene has been lost from Rosaceae, Ulmaceae, Moraceae and four genera of Rhamnaceae. Moraceae, Ulmaceae, Barbeyaceae, Cannabaceae, Boehmeria, Rosaceae and partial species of Rhamnaceae. Further, comparative analysis of Rosales plastid genomes indicated that the duplication of trnH only occurs in Elaeagnaceae.

Discussion

Plastid genomes of E. mollis were similar to the previous studies of the plastid genome of E. mollis (NC_036932) in size, structure, gene content, GC content and typical quadripartite structure (Wang et al. 2017a). In this study, plastid genomes of E. mollis were compared with E. macrophylla and H. rhamnoides, we found that all the plastid genomes possessed the typical quadripartite structure with circular and double-stranded DNA. The size, structure, gene content, and GC content of the newly generated plastid genomes were generally similar to those published for Elaeagnus, which revealed that plastid genomes are conserved in Elaeagnus. Additionally, six divergence hotspots (atpH-atpI, petN-psbM, trnT-psbD, trnP-psaJ, rpl32-trnL and ycf1) were reported by comparing the plastid genomes of the three Elaeagnaceae species, which could be used as molecular genetic markers for population genetics and phylogenetic studies. Although infA gene is always absent in most Rosales plants (Millen et al. 2001; Su et al. 2014; Choi et al. 2015), it was intact in Elaeagnus. Compared with other plants of the same genus, the IR region of plastid genome also changed slightly, which may be caused by the contraction and expansion of IR region (Huang et al. 2014). Codon usage bias reflects synonymous codons that have different usage frequencies (Ermolaeva 2001). The results showed a strong bias toward A/T at the third codon position, which is in line with previous findings of a third-nucleotide preference for A/T in other land plants (Wang et al. 2017b; Zhou et al. 2017).

The duplication of trnH gene only occurs in Elaeagnaceae, which could be a useful marker in Rosales. Elaeagnaceae include three genera, i.e. Elaeagnus, Hippophae and Shepherdia. Elaeagnaceae have been included in different orders in different classifications, such as close to Proteaceae, Rhamnaceae, Thymelaeaceae or Penaeaceae (Jansen et al. 2000). The phylogenetic analysis of Rosales revealed that Elaeagnaceae are sister to Barbeyaceae. Elaeagnaceae in a clade composed of Barbeyaceae and Rhamnaceae, Rosaceae were sister to other Rosales, which were similar with previous studies (Sytsma et al. 2002; Wang et al. 2009; Zhang et al. 2011). The remainder of the order comprises two subclades; (i) Ulmaceae are sister to Cannabaceae plus (Boehmeria and Moraceae); (ii) Rhamnaceae are sister to Elaeagnaceae plus Barbeyaceae. The plastid genome sequences fully resolved phylogenetic relationships within Rosales with strong internal support. This is the first analysis of complete plastids of all Rosales lineages, and our results are generally similar to the previous observations which used two nuclear and 10 plastid loci to infer phylogeny of Rosales (Zhang et al. 2011).

Conclusions

In this study, we successfully sequenced plastid genomes of E. mollis and compared them with plastids from other species of Elaeagnaceae to determine the sequence variations and molecular phylogenies. Plastids in E. mollis showed a typical quadripartite DNA molecular structure, which is similar to those in other angiosperm species. We determined divergence hotspots of medicinal and economic values in E. mollis that could be used as potential genetic markers for the further studies.

We reconstructed the phylogenetic relationships of Rosales based on the complete plastid genomic data, aiming to clarify the evolutionary relationships among the major clades of Rosales. We successfully resolved the evolutionary relationships among the major clades of Rosales and also found that the duplication of trnH occurs only in Elaeagnaceae, suggesting that trnH may be an important marker for the phylogenetic studies of Rosales. Our study not only contributes to a basic understanding of E. mollis plastid genomes and provides a valuable resource for the evolutionary research in the Elaeagnaceae family, but also demonstrates the effectiveness of plastid phylogenomics to further resolve the phylogenetic relationships in Rosales.