Introduction

Lindera glauca (Sieb. et Zucc.) Bl. (Lauraceae), an economically important tree with a dioecious breeding system, is widely distributed in montane warm-temperate evergreen (WTE) forest of subtropical China, Japan and South Korea [1]. The fruits of L. glauca contain some aromatic essential oils, alkaloids, terpenoids and flavonoids, which are widely employed in the essence, medicine, and chemical industries [2]. Thus, cultivators pay increasing attention to the collection of germplasm resources for breeding research in L. glauca due to its increasing use in commercial products [3]. In addition, L. glauca is of interest to phylogeographers that aim at elucidating the evolutionary history of key components of East Asia’s extraordinarily species-rich WTE forest communities. Notably, field observations indicate that both female and male individuals of L. glauca occur in China while only female individuals are found in Japan. In contrast to their outcrossing (insect-pollinated) mainland conspecifics (i.e. in China/South Korea), these Japanese females reproduce by apomixis (asexual reproduction by seeds) [4]. In general, for colonizing species, bottlenecks associated with the occupation of new habitat can play a central role in regulating sexual-system variation [5]. Therefore, the intra-specific change in the reproductive system of L. glauca may be the result of colonization from China to Japan [4]. If so, measures of the distribution of genetic diversity in L. glauca across its range in China and Japan should provide a test of the association between the species’ reproductive mode and colonization history and elucidate how this component species of WTE forest responded to past environmental change. However, as to the choice of appropriate molecular markers, species of Lauraceae are known for their extremely conservative evolution of chloroplast DNA [6], which thus renders this molecule less ideal for elucidating the phylogeographic and population history of L. glauca. Recently, the transcriptomes of distinct tissues and different stages of the developing fruits of L. glauca were analyzed to understand fruit growth regularity and lipid accumulation [3] as well as differences in unigenes involved in terpenoid biosynthesis [7]. However, to date no study has developed molecular markers in L. glauca as important foundations for clarifying its evolutionary and phylogeographic history.

With the ongoing development of next-generation sequencing (NGS) platforms and advances in the latest bioinformatics tools, genome-level information has been generated for several model and non-model organisms. However, the sequencing of complex genomes remains expensive even using NGS technologies. RNA-sequencing (RNA-seq) represents an attractive alternative to whole-genome sequencing because it only analyzes transcriptome, a collection of expressed sequences of the genome, while avoiding non-coding and repetitive sequences that can make up much of the genome [8]. Until now, RNA-seq not only offers promising new avenues for gene annotation/discovery and a better understanding of the expression and regulation of gene networks, but also provides valuable datasets for enabling the detection of thousands of single nucleotide polymorphisms (SNPs) for further SNP marker development and mining large numbers of SSR markers in non-model organisms [8]. For instance, RNA-seq for SSR mining has been used in many plant species, such as Neolitsea sericea [9], Dysosma [10], Primulina [11]. These genome-wide markers can be used to infer demographic histories, estimating genomic variation and identifying genetic basis of adaptive divergence within and between species [11]. In this study, we used Illumina technology to obtain RNA sequence data from the female and male flower buds of L. glauca and employed the trinity method [12] for an efficient and robust de novo assembly of respective transcripts. The aims of this study were to (1) characterize the transcriptomes of the female and male flower buds of L. glauca, (2) develop EST-SSR markers and validate their levels of polymorphism across six populations of L. glauca (four from China and two from Japan).

Materials and methods

Plant material collection, RNA and DNA extraction

In April, 2015, flower buds were separately collected from one female and one male tree of L. glauca grown in Mount Tianmu (Zhejiang Province, East China), and immediately frozen in liquid nitrogen till for total RNA extraction. RNA from each gender was isolated using TRIzol Reagent (Invitrogen Life Technologies, USA) and treated with DNase (TaKaRa, Shuzo, Kyoto, Japan) at 37 °C for 30 min. In addition, silica-dried leaf material of 96 individuals from six populations of L. glauca from China (LQS, SNJ, TMS, ZJJ) and Japan (DF, QY) were used to develop EST-SSR markers (Table 1). Total genomic DNA was extracted from the dried leaf material using DNA Plantzol (Invitrogen).

Table 1 Sampling location of L. glauca from China and Japan in this study

Transcriptome sequencing, data filtering and de novo assembly

The cDNA libraries of the male and female flower buds of L. glauca were separately constructed and normalized following protocols described previously [13]. Transcriptome sequencing of the two cDNA libraries were performed on the Illumina HiSeq™ 2000 platform at Hangzhou 1gene Tech Inc., Hangzhou, China. All of the sequencing raw data were uploaded to the NCBI Sequence Read Archive (SRA, http://www.ncbi.nlm.nih.gov/traces/sra) with accession numbers SRX1519207 and SRX1519078. To obtain high-quality reads for de novo assembly, the raw reads were treated with removing adaptor sequences as well as low-quality reads with the percentage of unknown nucleotides (‘N’) higher than 5 % and that of low-quality bases (quality value lower than 20) higher than 15 %. The clean reads were assembled to construct contigs using trinity version 20131110 [12]. These contigs were subsequently clustered into unigenes with tgicl version 2.1 [14].

Functional annotation, classification and comparative analysis of unigenes

For functional annotation, all assembled unigenes were employed for homology searches against NR (NCBI non-redundant database), Swiss-Prot, COG (the Cluster of Orthologous Groups) and KEGG (the Kyoto Encyclopedia of Genes and Genomes) using blastx with an E-value of 1e−5. Based on blastx hits from the NR database, blast2go version 2.6.0 [15] was used to obtain gene ontology (GO) terms of the unigenes with an E-value of 1e−5. To view the distribution of gene functions of L. glauca, we performed GO functional classification for all unigenes at level-2 GO terms including biological processes, cellular components and molecular functions using the web gene ontology annotation plot (WEGO) [16]. The relative expression levels for unigenes were assessed by fragments per kilobase per million mapped reads (FPKM) [17]. An FPKM value with at least two-fold difference between the female and male samples and a threshold for false discovery rate (FDR) < 0.05 were used as criteria to determine putative differentially expressed genes (DEGs) [17].

SSR detection and primer design

The MicroSAtellite software (MISA, http://pgrc.ipk-gatersleben.de/misa) was used to identify SSRs from all unigenes. In order to identify perfect SSR motifs, we set the parameters of mono-, di-, tri-, tetra-, penta-, and hexa-nucleotide motifs to a minimum of 12, 6, 5, 5, 4, and 4 repeats, respectively. The primer pairs of SSRs were designed using primer3 version 2.3.4 [18] with the following criteria: (1) primer length varying from 18 to 28 bp with an optimum of 23 bp, (2) annealing temperature between 55 and 65 °C with an optimum of 60 °C, and (3) sizes of PCR products ranging from 80 to 160 bp.

EST-SSR development and population genetic analyses

The EST-SSR markers were initially tested for PCR amplification among 12 individuals of L. glauca (i.e. two per population). The PCR amplification reactions were conducted in a 20 μl reaction volume containing 10 μl 2× master mix (TsingKe, Beijing, China), 8 μl of ddH2O, 0.5 μl of 10 μΜ forward and reverse primer, and 1 μl of genomic DNA (50 ng). PCR amplifications were performed using a thermal cycler GeneAmp PCR System 9700 (Applied Biosystems, Foster City, CA, USA) with a procedure as follows: (1) denaturation at 95 °C for 5 min, (2) 30 cycles at 95 °C for 30 s, annealing temperature for 45 s, 72 °C for 30 s, and (3) final extension at 72 °C for 10 min, followed by storage at 4 °C. The successful PCR products were tested for product size and polymorphism using 12 % non-denaturing polyacrylamide gels with a 100 bp marker (TaKaRa, Dalian, Liaoning, China) as reference and visualized by silver staining. For the full population genetic survey, the polymorphic primer pairs were synthesized again with fluorescent dyes (TAMRA, HEX, or 6-FAM) at the 5′ end of the forward primer and were used to amplify DNA of all six L. glauca populations (96 individuals in total). Fragments of PCR product were separated on a 3730xl DNA Analyzer (Applied Biosystems) and then the alleles were manually scored and determined using genemarker version 2.2.0 (SoftGenetics, PA, USA) with GeneScan 500 LIZ as size standard.

To evaluate the polymorphism of each EST-SSR locus, we calculated the following parameters using cervus version 3.0.3 [19]: the number of observed alleles (N A), observed vs. expected heterozygosity (H O vs. H E), and the polymorphism information content (PIC). The significance of departures from Hardy–Weinberg equilibrium (HWE) as the result of heterozygote deficit/excess was tested in genepop version 4.0.7 based on the deviations of F IS values from zero [20]. We also estimated the frequencies of null alleles using freena according to the expectation maximization (EM) method [21].

At the population levels, the following genetic parameters were estimated across all loci using fstat version 2.9.3 [22]: total number of detected alleles (N A), allelic richness (R S), expected heterozygosity (H E), and the average inbreeding coefficient (F IS). Based on the entire EST-SSR dataset, population genetic structure was inferred by Bayesian clustering in structure version 2.3 [23]. The number of clusters (K) was set to vary from one to six, and for each K, 10 independent runs was performed with 50,000 Markov Chain Monte Carlo (MCMC) replications and 50,000 burn-in. To explore the optimum number of clusters (K), two alternative methods, lnP(D) [21] and ΔK [24] were considered. Using the program poptree version 2 [25], a UPGMA tree was constructed to reveal relationships among populations based on Nei et al. [26] D A distances among populations, with bootstrap values obtained from 10,000 resampling replicates across loci.

Results and discussion

De novo assembly of the female and male bud transcriptomes of L. glauca

In this study, we obtained 48,998,398 clean reads from the female flower bud transcriptome. The GC percentage was 49.33 %, and the Q20 percentage (percentage of sequences with sequencing error rate lower than 1 %) was 97.38 %. From the male bud transcriptome, 51,783,474 clean reads were obtained, with GC percentage of 48.97 % and Q20 percentage of 97.64 %.

Through de novo assembly, the clean reads for the female were assembled into 123,154 contigs with an average length of 370 bp and an N50 length (the contig size such that 50 % of the entire assembly is contained in contigs equal to or longer than this value) of 643 bp, and for the male into 163,380 contigs (mean length: 347 bp; N50 length: 525 bp). All contigs were then respectively clustered into 59,753 unigenes for the female (mean length: 789 bp; N50 length: 1240 bp), and into 75,075 unigenes for the male (mean length: 748 bp; N50 length: 1228 bp). The length distributions of the contigs and the unigenes are shown in Fig. S1. Notably, the average lengths of the unigenes for both the female and male buds of L. glauca were longer than those recently published for other tissues of the same or other species, including 560 bp of L. glauca, 584 bp of Cinnamomum camphora and 531 bp of Dysosma aurantiocaulis [7, 27, 28].

Functional annotation, classification and comparative analysis of unigenes

The functional annotation of all unigenes was performed by searching against public protein databases, including NR, Swiss-Prot, COG, KEGG and GO, using blastx with an E-value of 1e−5. Based on sequence similarity, 44,379 (74.27 %) unigenes for the female and 45,414 (60.49 %) unigenes for the male showed matches in at least one database. The numbers of unigenes annotated with each database are summarized in Table S1. Using the blast2go program, we were able to assign at least one GO functional term to 30,119 (67.87 %) and 29,521 (65 %) unigenes of the female and male, respectively. These annotated unigenes were then classified into three main GO categories: biological process [female: 95,901 (46.16 %); male: 93,753 (46.47 %)], cellular component [77,239 (37.18 %) vs. 73,509 (36.43 %)], and molecular function [34,622 (16.66 %) vs. 34,495 (17.10 %)], which were then subdivided into subcategories (Fig. 1). However, a fraction of unigenes from both the female (25.73 %) and the male (39.51 %) could not be aligned to the above databases, suggesting that these sequences may be fast evolving genes, novel proteins, or untranslated regions (UTRs) in L. glauca [28]. In order to compare genes expression in female and male buds, unigenes were quantified as FPKM. A total of 48,542 unigens showed detectable expression with FPKM > 1. Among them, 3718 and 7444 had specific expression in female and male buds, respectively, while 11,127 were found to be differentially expressed in two sex-type flowers.

Fig. 1
figure 1

Distribution of GO terms in the female and male buds of L. glauca

Identification of EST-SSRs from the L. glauca transcriptome

EST-SSRs are proven to be important molecular markers for assessing genetic diversity, structure, and demography in natural plant populations but are also of relevance for applied and experimental research, including the development of genetic maps for marker-assisted selection [29]. However, there has been no study of EST-SSR development in L. glauca. In this study, we identified a total of 20,048 EST-SSRs across all unigenes derived from the transcriptomes of both female and male buds. Specifically, EST-SSRs containing di-nucleotide repeats were most frequent (8520, 42.50 %), followed by those with tri-nucleotide repeats (7172, 35.77 %) and mono-nucleotide repeats (3256, 16.24 %). The distribution of SSR repeat motifs is summarized in Table S2. As shown in Fig. S2, in the mono-nucleotide repeats, the A/T motif was much more common than C/G; in the di-nucleotide repeats, AG/CT was much more prevalent than AC/GT, AT/AT or CG/CG; and in the tri-nucleotide repeats, AAG/CTT was the most common motif, followed by AGG/CCT, AGC/CTG and ATC/ATG. Overall, these results support previous observations that di-nucleotide AG/CT repeats and tri-nucleotide AAG/CTT repeats are the most abundant SSR motifs in dicots [9, 10].

Development and validation of EST-SSR markers

Using primer3, primer pairs were designed successfully for 3147 EST-SSR loci, while for the remaining loci this could not be achieved because of inappropriate or too short flanking sequences [10]. Based on this dataset, 120 primer pairs (Table S3) were randomly selected and synthesized for testing polymorphism. Of those, 91 primer pairs (75.8 %) successfully produced PCR products of the expected size, while the remaining 29 primer pairs (24.2 %) failed to amplify or produced weak bands, even after adjustments to the PCR conditions. Thirteen of the 91 primer pairs could reliably amplify polymorphic bands in the initial screening of 12 individuals, whereas 49 (40.8 %) and 29 (24.2 %) showed monomorphism and nonspecific amplification, respectively. These 13 polymorphic loci were synthesized again and then used to evaluate the levels of polymorphism in all 96 L. glauca individuals (Table 2). Across the 13 loci, a total of 58 alleles were observed and the number of alleles (N A) per locus varied from three to six with an average of 4.42 alleles. The observed (H O) and expected (H E) heterozygosity varied from 0.42 to 0.92 and from 0.37 to 0.72, respectively. The PIC values varied from 0.34 to 0.67, with a mean of 0.49. Nine out of the 13 loci displayed significant deviation from HWE after Bonferroni correction (P < 0.0038). Since no null alleles were detected at these loci, other causes for these deviations could be due either to inappropriate sampling—population admixture Wahlund effect or biological (inbreeding, negative/positive assortative mating, associative overdominance, small reproductive population size, asexual reproduction) causes [30]. Overall, the developed EST-SSR markers showed sufficient levels of polymorphism across loci to allow first insights into the population genetic diversity and structure of L. glauca.

Table 2 Thirteen EST-SSR markers developed in L. glauca, characteristics of polymorphism, and the results of Hardy–Weinberg equilibrium testing

Population genetic diversity and structure

During the investigation of the six populations of L. glauca, the total number of detected alleles (N A) per population varied from 27 (QY) to 52 (TMS) with a mean of 32.63; allelic richness (R S) varied from 2.08 (QY) to 3.38 (TMS) with a mean of 2.38; and values of expected heterozygosity (H E) ranged from 0.36 (LQS) to 0.54 (TMS) with a mean of 0.45 (Table 3). At the regional level, the mean estimates of within-population diversity (e.g. R S, H E) were similar in China (R S = 2.61, H E = 0.44) and Japan (R S = 2.15, H E = 0.45). By contrast, almost all populations of L. glauca (excepting one in China: TMS) displayed negative F IS values (Table 3), suggesting that heterozygosity excess in these populations led to significant departures from HWE at most loci. Recalling that Japanese L. glauca is solely composed of asexual (apomictic) females [4], the above results are consistent with those of previous studies, indicating that asexual populations usually display negative F IS values [30], and harbor less allelic diversity than their sexual counterparts but high heterozygosity at most loci [31, 32]. The latter is often explained by long-term asexuality that maintains heterozygosity or even increases it by mutation over generations [30] or hybrid origins of asexual lineages [33]. For Japanese L. glauca, however, the present data do not allow distinguishing among these alternative hypotheses.

Table 3 Population genetic diversity, inbreeding parameters, and P value of Hardy–Weinberg equilibrium of 6 populations of L. glauca from China and Japan based on EST-SSR markers

In the structure analysis, it was difficult to determine the optimal number of K clusters following Pritchard et al. [23] because the posterior probability of the data [lnP(D)] increased progressively as K increased. Hence, we considered the most appropriate cluster number following the △K statistic of Evanno et al. [24], which showed the highest value at K = 3 (Fig. S3). Accordingly, Chinese individuals from populations SNJ and LQS were mostly assigned to separate clusters (red vs. green in Fig. 2a, b), while those from ZJJ and TMS showed admixture for both gene pools plus the one (blue) fixed in all populations from Japan (DF, QY). However, in the UPGMA dendrogram (Fig. 2c), Chinese and Japanese populations formed distinct clusters (each with 100 % bootstrap support). This latter pattern is typical for Chinese-Japanese plant disjunction that have long been separated across the East China Sea, either as a result of landbridge submergence (viz. vicariance) during the last inter-/postglacial periods, or following over-sea seed dispersal [34]. In either event, the blue gene pool fixed in Japanese L. glauca most likely signifies a historical population bottleneck of early immigrants. Based on current evidence [4], it seems most likely that these immigrants were apomictic females on arrival rather than dioecious sexuals that evolved apomixis after arrival, due to genetic drift and/or selection pressures imposed by the special ecological conditions of the Japanese Archipelago [35]. However, testing these hypotheses requires further sampling and genetic, reproductive and ecological analyses of L. glauca across its entire range in China, Japan and South Korea.

Fig. 2
figure 2

a Geographic distribution of six populations of L. glauca from China (SNJ, ZJJ, LQS, TMS) and Japan (DF, QY). The colors correspond to the three genetic clusters (K) identified by structure across these six populations (96 individuals in total) based on their variation at 13 EST-SSR loci. b Barplot displaying the assignment of individuals to the three structure clusters, whereby each individual is represented by a vertical bar. c UPGMA dendrogram of genetic relationships among the six populations of L. glauca based on Nei et al. [26] D A distances. Numbers on the branches represent bootstrap values as percentage of 10,000 resampling replicates across loci

Conclusions

In this work, we performed de novo transcriptome sequencing of the female and male flower buds of L. glauca using the Illumina sequencing platform. We generated a large ESTs collection and identified a collection of candidate genes that differentially expressed in the male and female flower buds of L. glauca. Furthermore, we identified a total of 20,048 EST-SSRs. Marker validation assay revealed that 13 (11 %) of the 120 EST-SSRs exhibited expected polymorphism profiles. These data offer valuable molecular resources for future studies on functional genomics focused on the variable reproductive system of L. glauca (dioecy, apomixis) as well as population genetics of L. glauca and its intriguing evolutionary shift from dioecy to apomixis in the wake of island colonization.