Introduction

Approximately one-third of the Earth’s land surface is desert. In China, the desert regions occupy 22% of total land area (Ge et al. 2005). With the ongoing global warming, desertification has been—and will continue to be—a global issue that threatens human lives and activities worldwide (Wang and Liu 2013). Conservation of the genetic resources of endemic desert plants is an essential part of global efforts to curb desertification. Ammopiptanthus, the only genus with an evergreen broad-leaf habit in the desert and arid regions of Central Asia, plays an important role in maintaining desert ecosystems. Understanding the molecular mechanism of adaptation to desert environments in plants in the genus Ammopiptanthus would be helpful in protecting the environment of the desert regions.

The genus Ammopiptanthus (Leguminosae) comprises two species with high levels of morphological similarity: A. mongolicus (Maxim. ex Kom.) Cheng f. and A. nanus (M. Pop.) Cheng f. Both A. nanus and A. mongolicus grow in dry desert areas and are narrowly distributed species. The habitats of these two Ammopiptanthus plants are located in regions with the same latitude (39–40′), but approximately 2500 km apart. Both habitats are stony and/or sandy deserts where the annual precipitation ranges from 100 to 160 mm, and have similar annual average temperatures (7–8 °C). However, their habitats have different annual average sunshine (A. mongolicus: 3300–3800 h; A. nanus: 2400–2600 h) and different altitude levels (A. mongolicus: 1000–1200 m; A. nanus: 2100–2700 m). The two Ammopiptanthus plants are relic trees of the Tertiary period. According to fossil evidence, Central Asia, the habitat of the ancestral species of genus Ammopiptanthus, has undergone dramatic climate change over time. As the weather became drier and colder, the overwhelming majority of broad-leaf trees became extinct, and the distribution range of the genus Ammopiptanthus became smaller, probably resulting in habitat fragmentation that may have promoted the speciation of the two Ammopiptanthus species. Thus, comparative analysis between the two Ammopiptanthus species will facilitate our understanding of differential habitat adaptation and speciation.

Comparison of the transcriptomes of closely related species promotes the understanding of how genomic variation translates into morphological variation and helps to identify the associated selective pressures. The advent of RNA-Seq (whole-transcriptome shotgun sequencing using next generation sequencing technology) has opened the door to unprecedented large-scale and cross-species comparative transcriptome analyses (Wang et al. 2009; Necsulea and Kaessmann 2014). RNA-Seq allows for the comparison of the variation in the coding sequences (CDS) of ortholog pairs expressed in different species, subsequently leading to the identification of the rapidly evolved genes and providing clues to explain the evolution of divergent phenotypes. Such a strategy has been successfully adopted in the comparative analysis of frogs (Yang et al. 2012), lizards (Yang et al. 2014, 2015a), primroses (Zhang et al. 2013a), holly mangrove (Yang et al. 2015b), poplars (Zhang et al. 2013b), and ramies (Cheng et al. 2015).

Owing to the high academic value and ecological importance of the Ammopiptanthus plants, many studies have been conducted on their physiological and anatomical traits (Liu and Qiu 1982), drought resistance mechanisms (Wang et al. 2007; Xu et al. 2002), characterization of the putative stress tolerance-related genes (Chen et al. 2011; Wei et al. 2011, 2012a, b), and analysis of gene and miRNA profiling under drought and cold stress conditions using next generation sequencing (Zhou et al. 2012; Gao et al. 2015, 2016; Wu et al. 2014; Pang et al. 2013). However, almost all of these studies were carried out in A. mongolicus. Few studies were performed in A. nanus, despite the ecological significance of A. nanus in Central Asia. In addition, although the genetic diversity and geographic differentiation between the two Ammopiptanthus species and within each species were investigated in a previous study (Ge et al. 2005), a comparative study at the transcriptome level has not yet been reported, possibly due to the lack of A. nanus nucleotide data available in the public databases.

In the present study, we established a high-quality transcriptome dataset for A. nanus using the Illumina sequencing platform, and a large number of protein-coding genes and miRNA precursors were annotated from the assembled transcripts. A batch of orthologous genes that might be under positive selection were identified by comparing their transcriptomes. These results provide a comprehensive genomic resource for functional genomics research on A. nanus in the future, and enrich the current knowledge about the origin of the two Ammopiptanthus species and their adaptive evolution to their habitats in Central Asia.

Materials and methods

Ethics statement

The A. nanus seed collection and research activities were scientifically conducted under permits issued by Wuqia Forestry Bureau. The experimental procedures were approved by the Ethics Committee for Plant Experiments of Minzu University of China and the State Forestry Administration, China.

Plant materials

Ammopiptanthus nanus seeds, collected from the desert region in Wuqia County, Xijiang Autonomous Region, China, were surface-sterilized with ethanol and soaked in water for 48 h at 25 °C. The surface-sterilized seeds were sown in commercial pots (9-cm diameter) containing vermiculite and perlite (1:1, w/w), in a greenhouse at approximately 25 °C and 35% relative humidity under a photosynthetic photon flux density of 120 μmol m− 2 s− 1 with a photoperiod of 16 h light and 8 h dark. The seedlings were watered at 3-day intervals with half-strength Hoagland’s solution. 2–3 week after germination, roots and leaves of the seedlings were harvested and used for Illumina transcriptome sequencing.

Transcriptome sequencing

Total RNA was extracted using TRIzol reagent (Invitrogen, Burlington, Canada) according to the manufacturer’s protocol. The quality and purity of RNA samples were assessed using the RNA 6000 Nano LabChip kit and a Bioanalyzer 2100 (Agilent Technologies, Santa Clara, USA) with the RNA integrity number (RIN) > 7.0. Magnetic beads with Oligo (dT) (Invitrogen, Burlington, Canada) were used to isolate mRNA from 10 µg total RNA. Following purification, the mRNA was cleaved into short fragments with divalent cations under elevated temperature. Then, the cleaved RNA fragments were reverse transcribed to create the final cDNA library in accordance with the protocol for the RNA-Seq sample preparation kit (Illumina, San Diego, USA). The average insert size for the paired-end libraries was 300 bp (± 50 bp). We performed the paired-end sequencing on an Illumina HiSeq 2000 (LC Sciences, USA) following the vendor’s protocol.

De novo transcriptome assembly and annotation

The sequenced raw reads were subjected to a quality check using FastQC (http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) and then cleaned by removing reads with adaptor sequences, reads with unknown nucleotides larger than 5%, and low quality reads (reads with more than 15% of bases with Phred quality score of ≤ 20). Sequencing reads were de novo assembled using Trinity software (Grabherr et al. 2011) under default parameters and with a k-mer size of 25. Assembly quality was assessed by length distribution analysis with custom Perl scripts. N50 number, average length, max length, and contig number during different length intervals were all calculated. Contigs shorter than 200 bp were discarded from all assemblies. The assembled transcriptome sequences were named assembled transcripts.

All assembled transcripts were compared with the NCBI Nr, Swiss-Prot, Pfam, KEGG, COG, and GO databases (Ashburner et al. 2000) using BLASTX with a typical cut-off e-value of 1e − 5 to search for homologs. The putative CDS for each transcript were predicted using GENSCAN software (Burge and Karlin 1997). The putative transcription factor genes were identified by aligning the assembled transcripts to the peptide sequences of Glycine max TFs in PlantTFDB (http://planttfdb.cbi.pku.edu.cn/). A BLASTP search was performed, and an e-value of 1e − 5 was used as the threshold. For miRNA precursor prediction, the assembled transcripts were aligned to the stem-loop precursors of miRNAs in A. mongolicus and G. max with a cut-off e-value of 1e − 10, qualified secondary structures of miRNA precursors were determined according to the criteria defined in miRBase (Kozomara and Griffiths-Jones 2011).

Assessment of transcriptome assembly

Three assessment tools were used to estimate the quality and completeness of our transcriptome assembly. First, the Benchmarking Universal Single Copy Orthologs (BUSCO) v2 (Simão et al. 2015) were used to identify universal single copy orthologs (USCOs) in our transcriptome assembly, as a measure of the completeness. BUSCO analysis was performed using the plant dataset (embryophyta_odb9, creation date: 2016-11-01). Second, we used TBLASTX to query the list of 357 eukaryotic UCO protein sequences from Arabidopsis (http://compgenomics.ucdavis.edu/compositae_reference.php) (Kozik et al. 2008) with an e-value threshold of 1e − 10. BLAST results were parsed to determine the number of A. nanus transcripts that showed a positive hit to the UCO sequences with amino acid alignments of at least 30 residues. Third, the transcriptome coverage was assessed by comparing the A. nanus transcripts with the PlantTribes database (Wall et al. 2008). In this analysis, 959 shared single copy tribes from Arabidopsis thaliana, Populus trichocarpa, Vitis vinifera, and Oryza sativa (APVO) (Duarte et al. 2010) were compared with the A. nanus transcripts using TBLASTX and an e-value cut-off of 1e − 6.

EST-SSR marker identification

MISA (http://pgrc.ipk-gatersleben.de/misa/) was used to identify the potent simple sequence repeats from expressed sequence tag (EST-SSR) markers in all unique sequences. Dinucleotide repeats of more than six times, and trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeats of more than five times were considered the search criteria for simple sequence repeats (SSRs) in the MISA script.

Identification of orthologous contigs and estimation of substitution rates

We employed the reciprocal best-hit method in BLASTN to identify potentially orthologous sequences between A. nanus and A. mongolicus (Altenhoff and Dessimoz 2009; Zhang et al. 2013a). Pairs of transcripts that were each other’s best hit (e-value < 10− 10) and not less than 200 bp were retained. To obtain pairs of orthologous genes with higher confidence, the putative pairs of orthologous unigenes were further aligned against the G. max protein dataset. If the two transcripts in a putative ortholog pair mapped to different proteins via BLASTX, they were removed. The transcriptome sequences of A. mongolicus used for ortholog pair screening were assembled in a previous study (Gao et al. 2016).

To identify genes undergoing selection, we estimated the rates of nonsynonymous and synonymous substitution between A. nanus and A. mongolicus. The KaKs_Calculator was employed to estimate Ka, Ks, and Ka/Ks using the YN method (Zhang et al. 2006). A Ka/Ks ratio of 0.5 was used as the cut-off to identify genes under positive selection and ortholog pairs with Ks > 0.1 were excluded to avoid potential paralogs (Elmer et al. 2010). When genes with Ka/Ks > 0.5 were found, their sequences were checked manually to guarantee their accuracy. For validation, five positively selected genes (PSGs) across two species were randomly selected and their CDS were sequenced to confirm the corresponding mutation sites. Divergence in the CDS sites between A. nanus and A. mongolicus was calculated using the K2P model (Graur and Li 2000; Zhang et al. 2013a).

Results

Sequencing and assembly of the A. nanus transcriptome

To obtain a high-coverage de novo transcriptome assembly of A. nanus, we constructed two cDNA libraries from the leaves and roots of A. nanus seedlings and performed transcriptome sequencing using the Illumina HiSeq2000 platform. In total, approximately 41 and 53 million paired-end 125 bp reads were generated from leaves and roots, respectively (Table 1). All read data were deposited in the National Center for Biotechnology Information (NCBI) and can be accessed in the Short Read Archive (SRA) under the accession number SRR2886792. After trimming adapters and filtering out low quality reads, more than 93 million reads were obtained in total. These reads were assembled, using Trinity software, into 69,990 transcripts with a mean size of 1353 bp (Table 2). The size distribution of the assembled transcripts is shown in Fig. 1.

Table 1 Summary of the A. nanus raw sequencing data
Table 2 Summary of the assembled A. nanus transcripts
Fig. 1
figure 1

Length distribution of the assembled transcripts

Functional annotation of the A. nanus transcriptome sequences

For functional annotation, the assembled transcripts were aligned against the Nr (non-redundant protein sequences in NCBI), Swiss-Prot, Pfam, Kyoto encyclopedia of genes and genomes database (KEGG), clusters of orthologous groups of proteins (COG), and gene ontology (GO) databases using the BLASTX algorithm. A typical cut-off value of e < 10− 5 was used. As a result, 54,916 assembled transcript sequences (78.46%) were successfully annotated by at least one database (Table 3). The mapping rates of the assembled transcripts against the Swiss-Prot and Nr protein databases were 50.38 and 78.07%, respectively. The results indicated that the assembly represented a substantial portion of the entire A. nanus transcriptome, and the majority of the assembled transcripts were protein-coding genes.

Table 3 Statistics of the annotation results of the assembled transcripts

The annotation rate of the assembled transcripts (the number of annotated transcripts vs. the total number of assembled transcripts) was positively correlated with the length of the assembled transcripts. More than 92% of the assembled transcripts over 700 bp in length had homology matches in the Nr database, whereas less than 50% of the assembled transcripts shorter than 400 bp had significant matches (Fig. 2). The low percentage of BLAST hits for the short assembled transcripts may be partially due to the lack of a known conserved functional domain. Alternatively, these transcripts might represent non-coding RNAs.

Fig. 2
figure 2

The annotation rates of the assembled transcripts in different length distribution ranges

GO assignments were used to classify the functions of the A. nanus transcripts. Based on sequence homology, the 69,990 assembled transcripts were assigned GO terms. The annotated GO terms were classified into 50 functional groups that were distributed under the three main categories, including biological process, cellular component, and molecular function (Fig. 3). In the biological process category, “regulation of transcription”, “DNA-dependent”, and “proteolysis” were the major GO terms. In the cellular component category, “integral to membrane”, “nucleus”, and “plasma membrane” were the top three well-represented GO terms. Within the molecular function category, “ATP binding” “protein serine/threonine phosphatase complex”, and “protein binding” were the top three GO terms.

Fig. 3
figure 3

GO classification of the assembled transcripts

The assembled transcripts were also aligned to the COG database to predict and classify their possible functions. A total of 46,837 out of 69,990 sequences had COG functional classifications, which were grouped into 25 functional categories (Fig. 4). The top three categories were “General function prediction only” (6198 transcripts, group R), “Signal transduction mechanisms” (3956 transcripts, group T), and “Posttranslational modification, protein turnover, and chaperones” (3026 transcripts, group O). “Cell motility” (9 transcripts, group N), “Extracellular structures” (75 transcripts, group W) and “Nuclear structure” (129 transcripts, group Y) were the smallest COG categories.

Fig. 4
figure 4

COG function classification of the assembled transcripts

We further performed a systematic analysis of gene function by assigning the assembled transcripts to biochemical pathways in the KEGG database. A total of 23,653 transcripts were annotated in the KEGG database (Supplementary Data 1), which were associated with 2818 KO entries, 36 KEGG level-two pathways, and 269 KEGG pathway terms (Fig. 5). Among these pathways, “Starch and sucrose metabolism” was the largest group of transcripts (647 transcripts, ko01100), followed by “Ribosome” (644 transcripts, ko03010), “Purine metabolism” (634 transcripts, ko00230), “Endocytosis” (506 transcripts, ko04144), “Glycolysis/Gluconeogenesis” (452 transcripts, ko00010), “MAPK signaling pathway” (451 transcripts, ko04010), “Ubiquitin mediated proteolysis” (432 transcripts, ko04120), and “Phenylpropanoid biosynthesis” (391 transcripts, ko0090). The top six level-two pathways were “Carbohydrate Metabolism”, “Amino Acid Metabolism”, “Signal Transduction”, “Transport and Catabolism”, “Cell Growth and Death”, and “Lipid Metabolism”.

Fig. 5
figure 5

KEGG function classification of the assembled transcripts

Assembly quality assessment and sequence conservation of the A. nanus transcripts

We first assessed the quality and completeness of our assembled transcripts by comparing their sequences to a core set of plant genes using BUSCO. The result showed that, of 1440 BUSCO groups searched, 83.47% were “complete”, 5.21% were “fragmented”, and the remaining 11.32% were “missing”. Ultraconserved orthologs (UCOs, available at http://compgenomics.ucdavis.edu/compositae_reference.php) and APVO sequences represent a highly conserved set of genes and have been widely used as important indicators for gene detection and sampling breadth. The two different assessment tools were used to estimate the transcriptome coverage of our assembly using the TBLASTX algorithm. We identified all 357 (100%) UCOs from the assembled transcripts. We detected 931 (97.1%) of the 959 shared single copy tribes represented in the PlantTribes database.

To evaluate the amino acid sequence conservation of the A. nanus transcripts, we analyzed the species distribution of the assembled transcripts by aligning the assembled transcripts against the Nr database. For the top BLAST hit species distribution of aligned transcripts in the Nr database, 47.53% were matched with sequences from G. max, followed by Cicer arietinum (15.41%), Phaseolus vulgaris (12.49%), Medicago truncatula (10.35%), and Lotus japonicas (2.78%) (Fig. 6). As expected, more than 88.57% of the distinct sequences in the assembled transcripts had top matches (first hit) with the protein sequences from the plants in the family Leguminosae.

Fig. 6
figure 6

Species distribution of the assembled transcripts. The species distribution of the assembled transcripts was determined by aligning against the Nr database and the top BLAST hit species were recorded and used for species distribution statistics

Prediction of CDS, transcription factor, and miRNA precursors

We utilized Genscan to identify the CDS for each assembled transcript, and 45,677 CDS (65.26% of 69,990 assembled transcripts) were predicted. Of these, 16,055 (35.15%) CDS were longer than 200 amino acids, and the 1755 longest unigenes had lengths over 1000 amino acids. All CDS were translated into peptide sequences according to the standard codon table.

Transcription factors (TFs) play important functions in gene expression regulation during plant growth, development and responses to environmental factors. We identified 3899 (8.54% of the 45,677 CDS) G. max TF analogs. The top four abundant TF families were bHLH (570 numbers), WRKY (513 numbers), MYB_related (272 numbers), and C3H (272 numbers).

Some of the assembled transcripts may represent the stem-loop precursor sequences of miRNAs. By aligning to the stem-loop precursors of A. mongolicus and G. max, a total of 71 predicted miRNA precursors were identified (Supplementary Data 2) most of which may be the orthologs of miRNA genes of A. mongolicus. Among these, 24 aligned to the precursors of conserved miRNAs identified in A. mongolicus, 16 aligned to that of the non-conserved miRNA identified with high confidence, and 31 aligned to that of the non-conserved miRNA candidates identified in A. mongolicus. Homologs of many stem-loop precursors of A. mongolicus were not identified from the assembled transcripts of A. nanus possibly due to the spatio-temporal expression of miRNA genes.

Orthologous contigs and substitution rates between the two Ammopiptanthus species

Orthologous gene identification is a necessary step in comparative genomics analyses. We identified 29,490 pairs of putative orthologous transcripts between A. nanus and A. mongolicus. After incorporating the G. max peptide sequences, 6,606 pairs of putative orthologs with full-length CDS were obtained. This reduction in ortholog number was caused mainly by the exclusion of the relatively young orthologs specific to genus Ammopiptanthus, which were discarded as having low similarity to G. max genes.

Among the 6606 pairs of orthologs between A. nanus and A. mongolicus, 430 pairs were identical, 1583 pairs had only synonymous or nonsynonymous substitutions, and 4593 pairs had both types of substitutions, for which the Ka/Ks ratio were calculated. The mean values of Ka (the number of nonsynonymous substitutions per non synonymous site), Ks (the number of synonymous substitutions per synonymous site), and the Ka/Ks ratio of all orthologous pairs were 0.0065 ± 0.0078, 0.0209 ± 0.0185, and 0.4023 ± 0.3867, respectively.

The peak Ks for orthologous transcript pairs can be used to estimate the times of divergence between closely related species. According to previously described methods, the age of the speciation event between A. nanus and A. mongolicus was calculated to be approximately 0.70 ± 0.62 Mya, which falls in the later stage of middle Pleistocene.

Identification of the rapidly evolved genes that might be under positive selection between the two Ammopiptanthus species

The Ka/Ks ratio in the protein-coding gene was used to estimate the selective pressure. Of the 4593 ortholog pairs, 15 pairs with a Ka/Ks value > 0.5 and a P value < 0.05 were identified, and these rapidly evolved genes were considered likely to have experienced or be experiencing positive selection (candidate positively selected genes, PSGs) (Table 4). For validation, the coding regions of five randomly selected PSGs across two species were isolated by PCR, and the mutations were confirmed by sequencing. The PSGs were aligned against Arabidopsis peptide database (TAIR10, https://www.arabidopsis.org) to reveal the associated biological processes, and the corresponding Arabidopsis homologs were determined. As shown in Table 4, the rapidly evolved genes are involved in multiple biological processes such as plant defense response, epigenetic regulation, development, intracellular transport, and protein folding.

Table 4 List of the rapidly evolved genes that might be under positive selection between the two Ammopiptanthus species

Identification and comparison of the EST-SSRs in the transcriptome sequences of A. nanus and A. mongolicus

After screening for EST-SSRs from the 69,990 assembled transcripts of A. nanus obtained in the present study and the transcriptome dataset of A. mongolicus assembled in a previous study (Gao et al. 2016), 12,064 and 15,034 SSRs distributed in 10,021 and 12,418 sequences were identified from the transcriptome sequences of A. nanus and A. mongolicus, respectively (Table 5). The EST-SSR frequency was 17.24 and 9.58%, and the distribution density was 0.12 and 0.14 per kb in the transcriptome sequences of A. nanus and A. mongolicus, respectively. Based on the repeat motifs, all SSR loci were divided into mononucleotide, dinucleotide, trinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide. For the both species, the most abundant repeat motif was trinucleotide, followed by dinucleotide, tetranucleotide, pentanucleotide, and hexanucleotide repeat units (Table 5). The top three classes of sequence repeat in the transcriptome sequences of A. nanus were AG/CT, AAG/CTT, and AC/GT, while the top three categories of sequence repeat in the A. mongolicus transcriptome dataset were AG/CT, AAG/CTT, and AT/AT (Fig. 7).

Table 5 Summary of the predicted EST-SSRs in A. nanus and A. mongolicus
Fig. 7
figure 7

Comparison of the distribution of repeat motifs of SSRs between two Ammopiptanthus species. Grey bar, A. nanus; black bar, A. mongolicus

We further search for SSRs in the putative orthologous transcripts, and found more than 4600 SSRs distributed among 3779 pairs of orthologous transcripts (Table 5, Supplementary Data 3). The distribution of repeat motif of SSRs in orthologous transcripts was similar in the two Ammopiptanthus species (Fig. 8), indicating that a great deal of EST-SSRs identified in orthologous transcripts can used for the two Ammopiptanthus species. At the same time, some species-specific SSRs were also found. For example, Hexanucleotide repeats were only identified in A. mongolicus, thus the 24 Hexanucleotide repeats may be species-specific SSRs (Table 5, Supplementary Data 4). In addition, we found different SSRs in the same pair of orthologous transcripts, and these SSR markers may also be species-specific (Supplementary Data 5). Although the validity of the putative species-specific SSRs is still to be confirmed by further analysis and experiments, our results provided important data which might contribute to marker development for further population-level studies between the two Ammopiptanthus species.

Fig. 8
figure 8

Comparison of the distribution of repeat motifs of SSRs in the putative orthologous transcripts between two Ammopiptanthus species. Grey bar, A. nanus; black bar, A. mongolicus

Discussion

Transcriptome sequencing, De novo assembly, and annotation for A. nanus

Illumina-based transcriptome sequencing has been demonstrated to be an efficient and cost-effective method to obtain transcriptome data and identify genetic markers. In the present study, a general transcriptome dataset of A. nanus was established for the first time via Illumina next generation sequencing and assembly. The N50 length of unigenes was 1662 bp and the average length was 1013 bp. These results are comparable to those of recently published xerophyte transcriptome studies, such as for Haloxylon ammodendron (N50 = 1345 bp, average length = 728 bp) (Long et al. 2014), Reaumuria soongorica (N50 = 1109 bp, average length = 677 bp) (Shi et al. 2013), and Cynanchum komarovii (N50 = 862 bp, average length = 604 bp) (Ma et al. 2015).

The quality and completeness of transcriptome assembly are very important to further transcriptome analysis. Assembly evaluation measures can be classified into two categories: reference-based and reference-free. Besides the two commonly used reference-free measures, the median unigene length and N50 length of contigs, three reference-based assembly evaluations were also adopted to assessed the quality and completeness of our assembled transcripts. We identified all 357 (100%) UCOs and 931 (97.1%) of the 959 shared single copy tribes represented in the PlantTribes database. As an emerging tool for estimating the completeness of genome sequences, BUSCO has been used for the assessment of plant transcriptome completeness. In the present study, although less than 89% BUSCO groups (“complete” and “fragmented”) were found in the assembled transcriptome sequences in A. nanus, the overall BUSCO percentages are greater than that of Spinacia tetrandra and Arundo donax, which were reported in two recent transcriptome studies (Evangelistella et al. 2017; Xu et al. 2015). Together, the results of the functional annotation and assessment of transcriptome coverage demonstrated that our transcriptome sequences represented a substantial portion of A. nanus protein-coding genes. Considering that A. nanus genomic information is not yet available in the public databases, the large batch transcriptome sequences obtained in the present study not only provide a good start for the elucidation of the molecular mechanism underlying the stress tolerance of A. nanus, but also enable the comparative transcriptome analysis between A. nanus and A. mongolicus.

Divergence time between A. nanus and A. mongolicus

Plant species in genus Ammopiptanthus are relic trees of the Tertiary period. It has been hypothesized that the ancestral species of Ammopiptanthus was widely distributed from the eastern border of the Pamir Plateau to the Gobi desert during the Tertiary period (Liu et al. 1995). The aridification and formation of deserts in Central Asia from the early Miocene (24–16 Mya) resulted in the fragmentation of the continuous distribution of the ancestral Ammopiptanthus (Ge et al. 2005). Genetic differentiation between A. nanus and A. mongolicus likely occurred after a geographic barrier formed from the early Miocene. Long-term reproductive isolation led to significant genetic differences between A. nanus and A. mongolicus. In the present study, we estimated that the divergence time between A. nanus and A. mongolicus is approximately 0.70 ± 0.62 Mya before, which falls in the later stage of middle Pleistocene. This estimated time range is roughly consistent with other available evidence. However, considering the disputes about the significance of the substitution rate (Lynch and Conery 2000), this divergence time is only a rough estimate based on the coding regions of orthologous genes; further proof is still needed to determine the accurate splitting time.

Positively selected genes between A. nanus and A. mongolicus

Ka/Ks values have been widely used to identify protein-coding genes under positive or purifying selection (Hurst 2002) and Ka/Ks > 1 is widely accepted as a sign of positive selection. In addition, a Ka/Ks ratio of 0.5 was also interpreted as a useful cut-off to identify genes under positive selection (Swanson et al. 2004). In the present study, 6606 pairs of putative orthologs with full-length CDS were obtained and 15 PSGs that were considered likely to have experienced or be experiencing positive selection were found. These orthologs were involved in multiple biological processes such as plant defense, epigenetic regulation, and plant development regulation, indicating that these biological processes were under evolutionary pressure during the speciation of the two Ammopiptanthus species. Four PSGs are involved in defense response of plant, This result is consistent with those reported in previous studies on two closely related primrose species (Zhang et al. 2013a) and two related Dipteronia species (Zhou et al. 2016). Of the four PSGs, No. 13 encodes the protein SCARECROW-like 14 (SCL14), which is a member of the GRAS family of transcription factors. Arabidopsis SCL14 is involved in the activation of many stress-responsive genes that contribute to the protection of plants against xenobiotic stress (Fode et al. 2008). PSG No. 1 is a homolog of Arabidopsis Constitutive Expresser of PR Genes 1 (CPR1). CPR1 is an F-box protein and functions as a key component in fine control of plant immunity by mediating proteasomal degradation of SNC1, a disease resistance protein (Gou et al. 2012). PSG No. 6 encodes an enhanced downy mildew 2 (EDM2) which mediated disease resistance by functioning as a direct or indirect regulator of Resistance to Peronospora parasitica 7 (RPP7) expression (Eulgem et al. 2007). PSG No. 14 encode SRFR1, which specifically functions as a negative regulator of effector-triggered immunity in Arabidopsis (Kim et al. 2014), and mutations in SRFR1 lead to constitutive expression of SNC1 (Kim et al. 2010).

Of the 15 identified PSGs, two are reported to participate in plant epigenetic regulation. DNA methylation plays an integral role in regulating development and environmental response, and EDM2, Arabidopsis ortholog of PSG No. 6, play a role in regulate genome DNA methylation patterns (Lei et al. 2014). PSG No. 11 encodes an Arabidopsis SWI3C homolog. As a component of the SWI/SNF and RSC chromatin remodeling complexes, SWI3C affects gibberellin biosynthesis and signaling, and regulates plant growth and development including leaf morphogenesis (Sarnowska et al. 2013; Vercruyssen et al. 2014).

It is noteworthy that four of the five PSGs discussed above are also involved in plant development regulation. Among them, PSG No. 1 (Arabidopsis CPR1 ortholog) might regulate leaf pavement cell development via affecting cytoskeleton (Han et al. 2015), PSG No. 6 (Arabidopsis EDM2 ortholog) might regulate vegetative growth and development of leaf epidermal cells (Tsuchiya and Eulgem 2010), PSG No. 11 (Arabidopsis SWI3C ortholog) is likely involved in the transition from cell proliferation to cell differentiation in a developing leaf and affects leaf size (Vercruyssen et al. 2014), and PSG No. 14 (Arabidopsis SRFR1 ortholog) probably affects plant architectures via interaction with TCP transcription factors (Kim et al. 2014). These results indicate that the rapidly evolved genes may coordinate development with the environmental conditions, partially via epigenetic regulation such as DNA methylation. The other PSGs are involved in various biological processes including mRNA export from the nucleus (PSG No. 4), nucleus organization (PSG No. 2), protein targeting to mitochondrion (PSG No. 7), protein folding (PSG No. 5), and photomorphogenesis (PSG No. 3) and associated with peroxisome (PSG No. 10) and cytoskeleton (PSG No. 12).

Taken together, in the present study, we characterized the transcriptomes of A. nanus and conducted a comparative transcriptomic analysis between A. nanus and A. mongolicus. The comparative analysis not only identify large number of orthologous transcripts and a batch of putative positively selected genes between A. nanus and A. mongolicus, estimated the divergence times, but also advance understanding on the evolutionary adaptation of the two species to their individual habitats. In addition, large amount of SSRs were predicted and compared in the two Ammopiptanthus species. These results will serve as a comprehensive genomic resource for functional genomics, population genomics, and association mapping studies in the two Ammopiptanthus species in future.

Author contribution statement

YZ, JF and FG conceived and designed the experiments. CW. and ZX performed the experiments. FG and HL analyzed the data. YZ and FG wrote the paper. All authors read and approved the final manuscript.