Introduction

Consisting of woody and/or herbaceous shrubs, lianas and tropical trees, the family Sapindaceae has approximately 141 genera and 1900 species (Soltis et al. 2000; Li 2003; Harrington et al. 2005). This family is constituted of four major subfamilies i.e., Sapindoideae, Dodonaeoideae, Xanthoceroideae, and Hippocastanaceae (which includes Acereae and Hippocastaneae tribes), (Harrington et al. 2005; Buerki et al. 2009). Dodonaea (subfamily Dodonaeoideae) is the largest genus which consists 70 species with significant economical and ecological benefits (Wagner et al. 1999; Guerin et al. 2012). In particular, Dodonaea viscosa is a dense woody shrub with seven subspecies based on morphological and habitat characteristics (West 1984). D. viscosa has numerous medicinal properties which have led to increased research interests on its antimicrobial, (Getie et al. 2003; Thring et al. 2007), antibacterial, (Khurram et al. 2009), wound healing, (Habbu et al. 2007), antioxidant (Mothana et al. 2010), and anti-inflammatory (Alagarsamy et al. 2007; Salinas-Sánchez et al. 2012) properties.

The size of a typical chloroplast genome ranges between 72 and 217 kb and it consists of a small single-copy (SSC) and one large single-copy (LSC) separated by a pair of inverted repeats (IRs) (Sugiura 1992; Tangphatsornruang et al. 2010). The complete cp DNA has highly conserved gene order and is of significant interest to plant evolution and systematic researchers (Wicke et al. 2011). Despite the next generation genome sequence assembly limitations, (Alkan et al. 2011) many chloroplast genome sequences have been recently documented. For instance, the use of complete cp genome in determining evolutionary patterns using genes from plastid genomes (Jansen et al. 2007), phylogenetic analysis of plastid genes (Moore et al. 2010) and the comparison of cp genomes representing angiosperm, gymnosperm, and fern families (Zhu et al. 2016) has been accomplished. In addition, there has been the successful usage of chloroplast DNA sequences to study the phylogenetics and phylogeography of angiosperms at lower taxonomic levels (Shaw et al. 2014).

Comparative cp genome analyses have been reported in several angiosperms e.g. in two Dipteronia species (Zhou et al. 2016), Ampelopsis brevipedunculata (Raman and Park 2016) and six Rehmannia species (Zeng et al. 2017). However, despite the rapidly advancing technology in genome sequencing, there is still limited Sapindaceae genomes available to date. Specifically, about eight complete chloroplast genomes have been reported across the Sapindaceae family. Previously, transcriptome sequencing, annotation, and polymorphism detection were performed by Christmas et al. (2015); while Harrington and Gadek (2009) analyzed the phylogeny of the species using nuclear ribosomal ITS and ETSf sequences. However, up to date there are no partial or complete sequences of chloroplast genome of the genus Dodonaea. Therefore, there is limited genetic knowledge regarding this genus. Our study aims were to unveil the cp genome of D. viscosa and compare it with eight other Sapindaceae cp genomes already reported, to detect the simple sequence repeats (SSRs), and to reconstruct its phylogenetic relationship with related species. The findings of this study will be valuable for further studies to understand the diversity, taxonomy, and phylogenetic relationships within Sapindaceae.

Methods

Plant materials and DNA isolation

Fresh leaves of D. viscosa were collected from Kenya (38°20′07″ E; 03°22′56″ S) and transported to Wuhan Botanical Garden, Chinese Academy of Sciences for further analyses. The duplicate voucher specimens were deposited at the East Africa herbarium (EA) and Wuhan Botanical Garden herbarium (HIB). The total genomic DNA was extracted from silica-dried leaves using modified cetyltrimethylammonium bromide (CTAB) method as described by Doyle (1991).

Genome assembly and annotation

A paired-end library was constructed using TruSeq DNA sample preparation kits according to the manufacturer’s protocol (Illumina, San Diego, CA, USA), and then the whole genome sequencing was performed with an average read length of 150 bp pair-end using the Illumina Hiseq 2500 platform at NOVOgene company (Beijing, China). The PRINSEQ lite V0.20.4 (Schmieder and Edwards 2011); phredQ ≥ 20, Length ≥ 50 was used to filter raw reads and the plastid DNA extracted from the high-quality reads by mapping the available cp genome of Sapindus mukorossi (NC_025554) using a BLASTn (E-value: 10−6). The retained reads were assembled by Velvet 1.2.10 (Zerbino and Birney 2008) with K-mer length of 99–119. Six sequence contigs (ranging from 7824 to 45,735 bp) were used for mapping against reference chloroplast genomes of (S. mukorossi, NC_025554; Dipteronia sinensis, NC_029338) using the software GENEIOUS R8.0.2 (Kearse et al. 2012).

Complete cp genome annotations, which included protein coding genes, tRNAs, and rRNAs was carried out using Dual Organellar GenoMe Annotator (DOGMA) according to default values as described by Wyman et al. (2004) with manual corrections for the start and stop codons, intron as well as exon boundaries. The positions of start and stop codons were further checked by BLAST homology searches. Similarly, the tRNA genes were verified by tRNAscan-SE 1.23 program (http://lowelab.ucsc.edu/tRNAscan-SE/, Schattner et al. 2005). A circular gene map was drawn in OGDraw v1.2 (Lohse et al. 2007). The fully annotated cp genome was deposited at the GenBank database (Accession no. MF155892).

Microsatellite detection analysis

SSR markers were detected using the Perl script tool Misa (Thiel et al. 2003) with minimal repeat numbers set at eight and five for mono- and di-nucleotides respectively and three for, tetra-, penta-, and hexa- nucleotides.

Comparative and phylogenetic analysis

The contraction/expansion regions of the inverted repeats (IRs) were compared among species in Sapindaceae family. Lastly, phylogenetic analysis was performed using maximum likelihood (ML) analysis in RAxML 8.0.20 following Stamatakis (2014) instructions. Subsequently, the best-fitting substitution model GTR + I + G model (p-inv = 0.47, and gamma shape = 0.93) were selected based on the Akaike information criterion as implemented in jModelTest 2.1.7 (Posada 2008). Finally, bootstrap analysis was performed with 1000 replications. The Bayesian inference (BI) analysis was performed by Mr Bayes version 3.2.6 (Ronquist and Huelsenbeck 2003) with default parameters. Bayesian analysis was performed for 20,000,000 generations with sampling every 5000 generations. The default setting was set for chain heating (temp = 0.2) and the first 10% of trees were removed as burn-in and the residual trees were combined to estimate posterior probabilities (PPs). Additionally, the genome rearrangement analyses of four Sapindaceae species with one outgroup species were performed using Mauve Alignment (Darling et al. 2004), and dot plots and gene identity plots between D. viscosa and S. mukorossi were created using PipMaker (Schwartz et al. 2000).

Synonymous (Ks) and non-synonymous (Ks) substitution rate analysis

The complete cp genomes of D. viscosa and its closely related species S. mukorossi were compared. We analyzed synonymous (Ks) and nonsynonymous (Ka) substitution rates and their ratio, Ka/Ks, using Model Averaging in the KaKs_Cal-culator program (Wang et al. 2011). 79 shared protein coding genes were aligned using Geneious Basic 5.6.4 (Kearseetal, 2012) to identify synonymous (Ks) and nonsynonymous (Ka) ratios using S. mukorossi as the reference in the alignment.

Results

Genome features

The cp genome of D. viscosa is a double-stranded DNA molecule with a length of 159,375 bp. Its LSC and SSC regions are 87,204 and 17,972 bp long, respectively, separated by a pair of inverted repeats (IRa and IRb) of 27,099 bp each. In addition, the overall GC content of this species was 37.9%, with the IR having a higher GC content probably due to the presence of all four Ribosomal RNA genes duplicated in these regions (42.8%) than SSC (32.0%) and the LSC (36.0%). The genome annotation analysis revealed 81 coded proteins, 30 tRNA and 4 ribosomal RNA genes (Table 1). A total of nine protein-coding genes including rpl22, rps19, rpl2, rpl23, ycf2, ycf15, ndhB, rps7 and rps12, had duplicates (Fig. 1). The LSC consisted of 60 protein-coding genes and 23 tRNAs, while the SSC comprised of 12 protein-coding genes and one tRNA with ycf1 gene present within the IRa/SSC boundary. Nine protein coding genes, seven tRNAs, and all the rRNAs were repeated in the IR regions. Seventeen genes contained one intron while three genes harbored two (ycf3, rps12 and clpP) (Table 1).

Table 1 List of genes encoded by Dodonaea viscosa chloroplast genome
Fig. 1
figure 1

Gene map of the Dodonaea viscosa chloroplast genome. Genes lying outside of the circle are transcribed clockwise, whereas genes inside the circle are transcribed counterclockwise. The colored bars indicate different functional groups. The dark gray area in the inner circle corresponds to GC content while the light gray corresponds to the AT content of the genome. IR inverted repeat, LSC large single copy, SSC small single copy are indicated

Cp genome comparison to other related Sapindaceae species

The cp genome of D. viscosa was similar to that of other Sapindaceae members in terms of gene content, and organization. A comparative analysis of the Sapindaceae family revealed that D. viscosa genome was 2,780 bp larger than the smallest genome. The total length of cp genome sequences of the eight species ranged from 156,595 to 160,481 bp. All species had a quadripartite structure with the length of the LSC region ranging between 85,227 and 87,204 bp, IR regions between 26,100 and 27,979 while the length of the SSC region varied between 17,972 and 18,873 bp (Table 2). The average GC content for all the available species was approximately 37.8%. The entire chloroplast genome structure, number, and order of genes among closely related species of D. viscosa appear to be well conserved (Table 2). Additionally, the alignment analysis revealed that the cp genomes of A. davidii, D. sinensis, D. viscosa, B. sacra (outgroup) and S. mukorossi were relatively conserved and no rearrangement occurred in the gene organization (Fig. 2). D. viscosa and S. mukorossi cp genome comparisons based on dot and identity plots analyses displayed conservation in terms of genome synteny (Fig. 3), however the percentage identities in the genomic sequences between the two species were relatively low in a region corresponding to psaB and psaA genes with > 50% identity, in contrast with other surrounding genic regions (Fig. 4).

Table 2 Comparison of chloroplast genomes of Sapindaceae species
Fig. 2
figure 2

MAUVE alignment of four Sapindaceae species chloroplast genomes and Boswellia sacra (outgroup). Within each of the alignments, local collinear blocks are represented by blocks of the same color connected by lines

Fig. 3
figure 3

Dot plot of chloroplast genome sequences shared by D. viscosa and S. mukorossi

Fig. 4
figure 4

The percentage identities in gene identity plots of the genomic regions in D. viscosa and S. mukorossi

Microsatellite detection analysis

In this study, a total of 249 SSRs repeat motifs were discovered in D. viscosa cp genome (Table 3). Among the SSRs identified, mononucleotide repeats were abundantly represented with 64.65%, followed by trinucleotide repeats with 29.72%. In addition, the di- and tetra-nucleotides occurred with less frequency of 2.81% each, while penta- and hexa- nucleotide repeats were not detected. The A/T repeat units were most common among the mononucleotides with proportions of 28.91 and 32.13%, respectively. Among the trinucleotides, AAG/CTT and AAT/ATT repeat motifs were the most abundant accounting for 12.05 and 8.83%, respectively, while only AG/CT and AT/AT dinucleotide motifs occurred with fewer frequencies accounting for 0.80 and 0.20%. The majority of tetranucleotide SSR motifs had AGAT/ATCT motif, followed by AAAG/CTTT, AAAT/ATTT, and AACT/AGTT motifs all of which had a similar low frequency (0.4%).

Table 3 Simple sequence repeats (SSRs) in the Dodonaea viscosa chloroplast genome

In all the genomes, mononucleotide repeats were the most abundant repeat type followed by the trinucleotides (Fig. 5). Penta- and hexa-nucleotides were not detected in most of the genomes and if present they occurred in low frequencies. In particular, penta- repeats were only observed in S. mukorosi, while hexa repeats were present in both Acer buergensis and S. mukorosi. Moreover, all the cp genomes shared twelve SSR motifs (Fig. 5).

Fig. 5
figure 5

Simple sequence repeats (SSRs) in the nine Sapindaceae chloroplast genomes

Comparison of IR and single-copy (SC) regions

The LSC/IRB/SSC/IRA region of D. viscosa was compared to the corresponding regions of its closely related cp genomes (Fig. 6). The duplicated rpl22 gene was detected in the LSC/IRA and LSC/IRB junctions of seven genomes, but this gene was located in IRA region in S. mukorossi and LSC in Acer miaotaiense 754 and 302 bp respectively from LSC/IRA border. The trnH gene was found in the LSC region of all cp genomes, varying from 0 to 135 bp from the IRB/LSC border region. Complete duplication of rps19 gene was observed in the IRA and IRB regions with 155–427 bp apart from the LSC/IRA and LSC/IRB boundary regions, however this gene was located at the LSC/IRA region in A. miaotaiense (Fig. 6). The ycf1 gene located in IRB/SSC region had the size variation of 5477 (D. dyeriana) to 5642 (S. mukorossi) bp in all cp genomes. The ycf1 pseudo gene crossed the IRA/SSC region in six genomes, whereas this gene was located in IRA region in D. dyeriana, A. morrisonense and A. davidii. Additionally, at the IRA/SSC junction the ycf1 pseudo gene and ndhF genes were overlapped in all the eight genomes except in D. dyeriana, moreover this gene was shifted to the SSC region in S. mukorossi, A. griseum and D. dyeriana with 29, 17 and 4 bp gap respectively (Fig. 6).

Fig. 6
figure 6

Comparison of IR, LSC and SSC border regions among nine Sapindaceae cp genomes

Phylogenetic analysis

Phylogenetic analysis was performed using a dataset of the whole chloroplast complete sequences from eleven species inclusive of D. viscosa and two species as outgroup species (Fig. 7). The reconstructed phylogenetic tree group all the sapindales into two groups. D. viscosa and S. mukorossi were clustered together with the strong bootstrap value of 100%. All the Acer species and Dipteronia species formed another group with strong bootstrap support.

Fig. 7
figure 7

Phylogenetic relationships based on complete chloroplast genome sequences of nine Sapindaceae species and two Outgroup species (Spondias tuberosa and Boswellia sacra) with maximum likelihood

Synonymous (Ks) and non-synonymous (Ks) substitution rate analysis

To estimate selection pressures of Chloroplast genes, a total of 79 protein coding genes in the cp genome of D.viscosa were used to compute synonymous and non-synonymous rates against S. mukorossi (Table S1). Genes with Ka/Ks ratios of not applicable (NA) were changed to 0 (Table S1). In this study most of the genes except rps14 had the Ka/Ks ratio of significantly less than 1 in all the shared protein coding genes (Fig. 8). A comparison among the genes in each functional group showed that the substitution rates fluctuated widely among the 79 coding genes, with Ka and Ks ranging from 0 to 1.0209 and 0 to 1.0365 (Table S1), respectively. The genes related to photosystem I had the highest synonymous rate, with an average of 0.2338, while those related to the ATP synthase gene showed the lowest average rate of 0.1257. Genes associated with the small subunit of ribosome showed the highest non-synonymous (Ka) average rate of 0.1065, while those associated with the photosystem II contained the lowest average value of 0.0112. More remarkably, based on the Ka/Ks values 67 genes indicated purifying selection, while one gene rps14 located in the LSC region had Ka/Ks ratio above 1.0, which indicates a positive selection, and no changes were observed in the Ks synonymous/ nonsynonymous rate in rps19, rps7, psaC, psbD, psbL, psbM, psbN, petN, atpH, rpl23, and psbI genes.

Fig. 8
figure 8

Comparison of nonsynonymous (Ka) and synonymous (Ks) substitution rates and Ka/Ks ratio between D. viscosa and S. mukorossi

Discussion

The length of D .viscosa cp genome reported here (157,375 bp) is similar but 1106 bp smaller than that of closely related species S. mukorossi (160,481) (Yang et al. 2016). Moreover, the genome organization, gene order and GC content of the cp genome are similar and conserved. The D. viscosa cp genome is AT-rich which is consistent with other species from Sapindaceae family, for example, Acer miaotaiense (62.12%) (Zhang et al. 2016), A. davidii (62.10%) (Jia et al. 2016) and other species in order Sapindales like Citrus sinensis (61.52%) (Bausher et al. 2006) and Phellodendron amurense (61.60%) (Chen et al. 2017) with the overall A + T content of 62.14%.

Among the 81 protein-coding genes, unusual start codons were observed in three genes (rps19, rps12 and ndhD) which were GTG, ACT, and ACG, respectively. Previous studies have reported that due to RNA editing the translation initiation codons ACG and GUG may be restored to the standard start codon AUG (Kuroda et al. 2007; Takenaka et al. 2013), hence the same process might have occurred to the reported genes. The presence of several SSR motifs in the chloroplast genome provides useful sources for designing primers which could be used for phylogeography and population structure at the species level or provide useful genomic information for further population genetics and phylogenetic relationship in this genus.

The variable IR and the single-copy (SC) regions have been considered as a main mechanism causing the length variation of angiosperm cp genomes (Kim and Lee 2004). In this study nine species (Sapindaceae) were used to compare the exact IR border positions and their adjacent genes (Fig. 6). The LSC, IR and SSC regions of D. viscosa was found to have a slight size differences compared to other closely related genera of same family S. mukorossi. The IRb/SSC extended into ycf1 gene leading to formation of ycf1 pseudo gene at the IRa/SSC boundary with lengths of 951–1400 bp, however this gene was located at the IRa region in A. davidii, A. morrisonense, and D. dyeriana. In addition, the ndhF gene overlapped with the ycf1 pseudo gene in most of the cp genomes but not in S. mukorossi, and D. dyeriana.

As shown in (Fig. 6) minor shifts occurred in IR/LSC borders for example the rpl22 crossed the LSC/IRA border in D. viscosa, A. griseum, A. davidii, A. buergerianum, A. morrisonense and the two Dipteronia species, with the pseudo fragment duplicated at the IRb/LSC region however, it was located at IRa region in S. mukorossi and LSC in A. miaotaiense. Remarkably, the SSC region of D. viscosa was smaller than those of other species while the trnH-GUG sequence was located in LSC region of all the genomes. Recent studies have shown that the rps19 gene is generally located in the IR regions in other angiosperm cp genomes (Li et al. 2017), our results also showed that this gene is located in IR region of all the eight species except in A. miaotaiense. Besides, the duplication of the genes is possibly due to IR region expansion which have been reported in other Sapindales (Bausher et al. 2006).

The phylogenetic tree showed a close relationship between D. viscosa and S. mukorossi, and formed a strong homogenous group with all the Acer and Dipteronia species (subfamily aceroideae) with maximum (100%) bootstrap value (Fig. 7). This observation is in line with previous studies which clustered all the Acer taxa forming a monophyletic clade (Zhang et al. 2016; Wang et al. 2017). Similarly, Dipteronia species formed a close relationship with S. mukorossi and D. viscosa as verified in our results with a strong bootstrap value (Fig. 7). Based on Bayesian inference the ML bootstrap supports (BS) the BI posterior probabilities (PP) strongly supported all the nodes, with values 100% and 0.9 respectively. All the nodes in the tree had high bootstrap support (Data not shown).

The selection pressure analysis showed that most genes exhibited a value less than 1, indicating the efficiency of purifying selection (Fig. 8). Recent studies (Machado et al. 2017; Raman and Park 2016) revealed that the Ka/Ks ratio favored purifying selection in many genes, this is consistent with our study. The average Ks values between the two closely related species were 0.1770, 0.0525 and 0.2191 for the LSC, IR and SSC regions respectively with an average Ks of 0.1662 across all regions (Table S1). Higher Ks values were observed in most of the genes across all the regions, with some genes revealing lower Ks values including ycf3, rps11, rps4, rps2, rpl36, rpl33, clpP,atpF, psaA, petN,rbcL, psbT,psbL, psbJ,psbI, psbF, psbE (LSC), only gene rps15 (SSC) and all the genes in IR region except ycf1 and rpl22 (Table S1). The distribution of Ks indicated that most genes in SSC region have experienced higher selection pressures than the other cp genome regions, whereas the IR region is more conserved. Fu et al. (2016) found higher evolution rate in SSC compared to the LSC and IR, this is in agreement with the results in this study. Based on region comparison IR region is more conserved than the LSC and SSC regions. This is consistent with earlier reports that showed IR region diverged at a slower rate compared to the LSC and SSC regions. (Cho et al. 2015; Fu et al. 2016).

This study has generated the complete cp genome sequence of D. viscosa, the first cp genome under subfamily Dodonaeoideae. Comparative analyses among species of Sapindaceae revealed that the available cp genomes of species under this family are well conserved in terms of overall structure. The phylogenetic analysis showed that D. viscosa and S. mukorossi are closely related. This novel D. viscosa cp genome could be a valuable resource that will aid in filling the gap in knowledge of genome evolution in Dodonoideae subfamily and related species.