Introduction

Neottopteris nidus L., a plant of the Aspleniaceae family and native of tropical Asia, is one of the most economically important foliage plants (Ellwood and Foster 2004). N. nidus is an epiphytic fern with erect, simple, and wavy bright green leaves (Ozanne et al. 2003). It was widely cultivated in temperate regions as a houseplant, which can help in cleaning our air naturally (Fernandez and Revilla 2003). Due to its unique shiny beautiful leaves, it has achieved great importance in the ornamental industry and in landscaping especially in areas where it is uncommon. In addition, N. nidus supports a high abundance and wide diversity of taxa, and it has an important ecological role in structuring rain forest arthropod communities (Karasawa and Hijii 2006; Ellwood et al. 2009).

In recent years, next-generation sequencing (NGS) technology has developed rapidly for life sciences research. It not only provides cost-effective, high-throughput, and comprehensive analyses for model organisms, but also can provide opportunities to analyze non-model organisms whose genomes have not been sequenced (Wang et al. 2009; González-Ballester et al. 2010; Li et al. 2010). Transcriptome sequencing is an efficient way to generate functional genomic-level data for non-model plants such as N. nidus, and a large number of expressed sequence tag (EST) sequences are valuable for gene annotation and discovery, expression profiling, comparative genomics, and development of molecular markers (Emrich et al. 2006; Feng et al. 2012; Xiao et al. 2013; Wu et al. 2015; Zhang et al. 2015). Initially, a traditional sequencing technology has made significant contributions to current genomics research and cDNA library construction, but this method is costly, time-consuming, and sensitive to cloning biases (Wei et al. 2011). The newly developed NGS technology provides abundant resources for research on novel gene discovery, molecular marker development, genes mapping, and so on. Recently, an increasing number of EST sequences have become available for model and non-model plants, but very limited EST sequences are currently available for N. nidus. At present, only 76 nucleotide sequences of N. nidus have been deposited in the National Center for Biotechnology Information (NCBI) nucleotide database (as of November 2015).

Simple sequence repeats (SSRs) are 1–6 bp long repeat sequences that occur ubiquitously in plant genomes and serve as important molecular tools for genetic and genomic research (Gur-Arie et al. 2000). SSR markers are useful for a variety of applications in plant genetics and breeding because of their genetic codominance, abundance, multi-allelic variation, high reproducibility, and high level of polymorphism (Powell et al. 1996; Aggarwal et al. 2007). SSRs can be divided into genomic SSRs and EST-SSRs on the basis of the original sequences used to identify simple repeats. However, traditional methods to isolate and identify genomic SSRs are labor-intensive, costly, and time-consuming (Zane et al. 2002). EST-SSRs designed from expressed sequence tag data are more evolutionarily conserved in comparison to markers that are generated from genomic sequences and show more transferability between species (Varshney et al. 2005; Portis et al. 2007). With the increasing number of EST sequences deposited in public databases, a large number of EST-SSRs have been developed, and the polymorphism of EST-SSRs has been evaluated in many plant species (Triwitayakorn et al. 2011; Asadi and Monfared 2014; Kumar et al. 2014; Jia et al. 2015; Chen et al. 2015).

In the present study, we used the Illumina HiSeq™ 2000 platform to characterize the N. nidus transcriptome and to develop a set of EST-SSRs. To our knowledge, this study is the first to characterize the complete transcriptome of N. nidus by analyzing large-scale transcript sequences by the Illumina paired-end sequencing strategy. These EST data sets of N. nidus will serve as a useful sequence resource for novel gene discovery and molecular marker development.

Materials and methods

Plant materials and RNA extraction

Neottopteris nidus was grown in the chamber under a 25/18 °C (day/night temperature), 60 % relative humidity, photoperiod of 16 h light/8 h dark at Jiangsu Academy of Agricultural Sciences, Nanjing, China. Young roots, shoots, and leaves of plants were collected, immediately frozen in liquid nitrogen, and stored at 70 °C for RNA extraction. For Illumina sequencing, total RNA was extracted using Trizol reagent (Invitrogen, Carlsbad, CA, USA) according to the manufacturer’s instructions. The extracted RNA was treated with DNase I (Takara Biotechnology, China) for 45 min at 37 °C to remove residual DNA. RNA integrity was verified using the 2100 Bioanalyzer (Agilent Technologies, Santa Clara, USA) with a minimum RNA integrated number (RIN) value of 8. Subsequently, equal amounts of total RNA from three N. nidus tissues were pooled to prepare a cDNA library.

cDNA library construction and Illumina sequencing

The cDNA library was constructed following the Illumina manufacturer’s instructions. First, total RNA was treated by DNA polymerase I, and Poly-A RNA was extracted from total RNA using Oligo (dT) magnetic beads. Fragmentation buffer was added to break the mRNA into short fragments. Using these short fragments for templates, random hexamer primers were used for first-strand cDNA synthesis. The second-strand cDNA was synthesized using buffer, dNTPs, RNase H and DNA polymerase I (Invitrogen, Grand Island, NY, USA). The resulting short fragments were purified using the PCR extraction kit and suspended in EB buffer for end repair and poly (A) addition. The short fragments were then joined to sequencing adaptors of both ends to form a tag library. After PCR amplification, suitable fragments were purified by agarose gel electrophoresis. In the end, the cDNA library was sequenced on the Illumina HiSeq™ 2000 platform.

De novo assembly of sequencing data

Before the transcriptome assembly, we carried out a stringent filtering process of raw sequencing reads. The reads were obtained from raw data by filtering out adaptor-only reads, low-quality reads, and reads with more than 10 % Q < 20 bases. All the clean reads were then used for transcriptome de novo assembly using the short read assembling program Trinity with the default settings (Grabherr et al. 2011). The Trinity first combines reads with an identity value of 95 % and a coverage length of 100 bp to form longer fragments without N, and these are called contigs. Next, Trinity connects the contigs by using N to represent unknown sequences between each two contigs and then forms transcripts. Finally, paired-end reads were performed to fill the gap between different transcripts in order to obtain unigenes. The short reads data sets were deposited into the NCBI Short Read Archive (SRA) with the accession number: SRX269217.

Gene annotation and analysis

All unigenes were analyzed for homology searches against protein databases, including the NR protein database (http://www.ncbi.nlm.nih.gov), Swiss-Prot protein database (http://www.expasy.ch/sprot), GO database (http://www.geneontology.org), COG database (http://www.ncbi.nlm.nih.gov/cog), and KEGG pathway database (http://www.genome.jp/kegg) with BLAST alignment (E value ≤10−5). The best aligning results were used to determine the sequence direction of the unigenes. To identify the best BLAST hits from the alignments, putative gene names and predicted proteins of the corresponding assembled sequences were produced. Functional annotations of all unigenes by gene ontology (GO) terms were performed based on the best BLAST hits using the Blast2GO software. WEGO was used for GO functional classification of unigenes and to view the distribution of gene functions at the macro level (Ye et al. 2006). For pathway-enrichment analysis, unigenes were mapped to terms in the KEGG database (Kanehisa et al. 2008).

SSR identification and primer design

To further evaluate the assembly quality and develop molecular markers, all unigenes were used to mine potential SSRs with the Simple Sequence Repeat Identification Tool (SSRIT, http://www.gramene.org/db/markers/ssrtool). The parameters were adjusted for identification of perfect di-, tri-, tetra-, penta-, and hexanucleotide motifs with a minimum of 6, 5, 5, 4, and 4 repeats, respectively. Mononucleotide repeats were ignored since distinguishing genuine mononucleotide repeats from polyadenylation products and single nucleotide stretch errors generated by sequencing was difficult. Primer pairs were designed for the flanking regions of SSRs using the BatchPrimer3 software. The primer design parameters were set as follows: primer length of 18–23 bases, GC content of 40–60 %, PCR product size of 150–500 bp, and annealing temperatures of 50–60 °C. In total, 100 pairs of primers were designed and validated by PCR reactions. PCR amplifications using DNA templates extracted from the ten fern germplasms were performed in 10 µl reaction mixtures containing template DNA 30 ng, 10× PCR buffer 1.0 µl, dNTPs (10 mM) 0.8 µl, primer (10 µM) 0.8 µl, and Taq DNA polymerase 0.5 U. The PCR reaction cycling profile was 94 °C for 5 min, followed by 33 cycles of 94 °C for 40 s, 50–60 °C for 50 s, 72 °C for 1 min, and a final step of 72 °C for 10 min. PCR products were analyzed by electrophoresis on 8 % non-denaturing polyacrylamide gels and visualized by silver staining.

Results

Illumina paired-end sequencing and de novo assembly

With the purpose of understanding the transcriptome of N. nidus, we constructed a cDNA library of pooled RNA samples to generate the transcriptome and sequenced using the Illumina sequencing platform. After a stringent quality check and data filtering, 53,313,049 paired-end reads with a total of 10.66 Gbp were generated and assembled into 41,173 unigenes. Among the clean reads, more than 87 % had Phred-like quality scores at the Q20 level. As a result, a total of 2,104,392 contigs with a N50 length of 316 bp and an average length of 223 bp, 63,529 transcripts (>200 bp) with a N50 length of 1895 bp and an average length of 1139 bp, and 41,173 unigenes (>200 bp) with a N50 length of 1737 bp and an average length of 969 bp were assembled, respectively (Supplementary Table S1). The length distributions of the contigs, transcripts, and unigenes are listed in Fig. 1. Taking unigenes as an example, the lengths of unigenes ranged from 201 to 14,471 bp. Of the 41,173 unigenes, 19,953 (48.46 %) were between 201 and 500 bp; 7793 (18.93 %) ranged from 501 to 1000 bp; 7815 (18.98 %) ranged from 1001 to 2000 bp; and 5612 (13.63 %) were larger than 2000 bp.

Fig. 1
figure 1

Overview of the N. nidus transcriptome assembly. a Size distribution of contigs, b size distribution of transcripts, and c size distribution of unigenes

Functional annotation of assembled unigenes

For the validation and annotation of the assembled unigenes, all unigenes were searched against the NCBI non-redundant (NR) protein database and Swiss-Prot protein database. Of the 41,173 unigenes, 24,657 (59.89 %) could be annotated based on sequences in the NR protein database, and 17,436 unigenes (42.35 %) had significant similarity to annotated proteins in the Swiss-Prot protein database (Supplementary Table S2). Altogether, 24,801 unigenes (60.23 %) were successfully annotated in the NR and Swiss-Prot protein databases, suggesting that this Illumina sequencing project generated a substantial fraction of N. nidus genes in this research. The proportion of unigenes with homologous matches decreased with the decrease of the length of the unigene. About 95.37 % of the assembled unigenes over 1000 bp in length showed homologous matches in the NR protein database, whereas only 29.34 % of the assembled unigenes shorter than 300 bp showed homologous matches (Fig. 2). The E value distribution of the top matches showed that 50.55 % of the aligned sequences had significant homology in the NR protein database (E value <1E−50), and almost 33.80 % of the aligned sequences had alignment identities greater than 60 % (Supplementary Fig. S1A, C). As expected, a comparable pattern of E value and similarity distributions of the top matches was found in the Swiss-Prot protein database. The result indicated that 42.70 and 26.89 % of the aligned sequences had significant homologies and alignment identities higher than 60 % in the Swiss-Prot protein database, respectively (Supplementary Fig. S1B, D).

Fig. 2
figure 2

Comparison of unigene length with or without hits. Longer unigenes were more likely to have BLAST hits in protein databases

Functional classification by GO, COG, and KEGG

Based on the NR annotation, 15,683 unigenes with BLAST matches to known proteins were categorized into 55 functional groups within the GO database, which could be classified into the three main categories: biological process, cellular components, and molecular function clusters (Supplementary Fig. S2). Among the cellular component category, cell part (9942, 24.80 %) was the most dominant group, followed by cell (9699, 24.19 %) and organelle (8421, 21.01 %). With the classification of molecular function, catalytic activity (7626, 45.58 %) and binding (6716, 40.13 %) were separately the first and second largest groups, whereas other categories such as metallochaperone activity, protein tag, receptor activity, nutrient reservoir activity, and antioxidant activity contained 223 unigenes only representing 1.32 %. As for the category of biological process, metabolic process (10,132, 22.66 %) and cellular process (9543, 21.35 %) were the dominant groups, followed by response to stimulus (4407, 9.86 %), biological regulation (3732, 8.35 %), and cellular component organization or biogenesis (2749, 6.15 %).

To further evaluate the integrity of the N. nidus transcriptome and the effectiveness of the annotation process, unigene sequences were subjected to a search against the Clusters of Orthologous Groups (COG) database for functional prediction and classification. Out of 24,657 unigenes with significant similarity to the NR protein database in our study, 7942 unigenes were assigned to the COG classifications (Supplementary Fig. S3). COG-annotated putative proteins were functionally classified into 25 categories. Among the 25 COG categories, the cluster for general function prediction only (2175, 19.39 %) represented the largest group, followed by replication, recombination and repair (1097, 9.78 %), transcription (1018, 9.08 %), signal transduction mechanisms (894, 7.97 %), and translation, ribosomal structure, and biogenesis (730, 6.51 %), whereas the percentages of five groups were less than 1.00 %, such as RNA processing and modification, chromatin structure and dynamics, cell motility, extracellular structures, and nuclear structure.

To identify biological pathways activated in N. nidus, the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway analysis was conducted with an E value cutoff less than 10−5. According to the KEGG database, 5548 unigenes were assigned to five main categories including 116 KEGG functional pathways (Supplementary Table S3). Among the five categories, metabolism was the biggest category (3263, 58.81 %), followed by genetic information processing (1629, 29.36 %), cellular processes (312, 5.62 %), environmental information processing (204, 3.68 %), and organismal systems (140, 2.53 %). The specific pathways, including ribosome, RNA transport, spliceosome, purine metabolism, plant hormone signal transduction, and the other 20 top mapped pathways, were shown (Supplementary Fig. S4). In addition, the KEGG metabolic pathways contained many amino acid synthesis and degradation pathways, among which arginine and proline metabolism, cysteine and methionine metabolism, phenylalanine metabolism, and alanine, aspartate, and glutamate metabolism occupied the majority of positions. The functional classification of KEGG provided a valuable resource for investigating specific processes, functions, and pathways involved in the transcriptome of N. nidus.

The most abundant transcripts in the N. nidus transcriptome

Of all the 41,173 unigenes, 23 contained more than 50 reads, which represented the most abundant transcripts in the N. nidus transcriptome (Table 1). In this study, two transcripts encoding chlorophyll ab-binding protein and phototropin were predominantly expressed, which are known to be involved in photomorphogenesis. Four other abundant transcripts encode cytochrome P450, serine/threonine-protein kinase, E3-ubiquitin protein ligase, and non-specific lipid-transfer protein, which are required for plant defense responses. Besides the stress response proteins, two transcripts encoding phenylalanine ammonia lyase and chalcone synthase were predominantly expressed, which were associated with the biosynthesis of flavonoid and performed a variety of functions in plants. Moreover, the transcripts encoding DNA-binding protein, ribosomal protein, amino acid transporter, expansin, ATP-binding protein, zinc finger protein, and RNA-binding protein, which are considered to play roles in amino acid metabolism, energy metabolism, cell proliferation, development, and growth in plants, suggest that these genes might be very important in N. nidus growth and development. There were also some highly frequent transcripts that had matches to unknown or predicted proteins. These genes may play important roles in N. nidus growth, development, defense responses, and secondary metabolite production and are worthy of further functional investigation.

Table 1 The most abundant transcripts in the N. nidus transcriptome

Development and characterization of EST-SSR markers

To further evaluate the assembly quality and develop new molecular markers, the 41,173 unigenes assembled were used to mine potential microsatellites that were defined as di- to hexanucleotide motifs. Using the SSRIT tool, a total of 5792 potential EST-SSRs were identified from 4823 unigenes (Supplementary Table S4). Among the 4823 unigenes, 3665 and 1158 unigenes contained one and more than one SSR, respectively. For these unigenes, the EST-SSR frequency was 11.71 %, and an average of one SSR was found every 4.89 kb in the unigenes.

The frequency, type, and distribution of the potential EST-SSRs were also analyzed. Dinucleotide repeats were the most abundant type (3836, 66.23 %), followed by trinucleotide (1575, 27.19 %), tetranucleotide (207, 3.57 %), hexanucleotide (116, 2.00 %), and pentanucleotide (58, 1.01 %) repeats (Table 2). The frequencies of EST-SSRs with different numbers of tandem repeats were further evaluated. As shown in Table 2, EST-SSRs with six repeat motifs (1481, 25.57 %) were the most common, followed by seven repeat motifs (1003, 17.32 %), five repeat motifs (981, 16.94 %), eight repeat motifs (739, 12.78 %), and nine repeat motifs (576, 9.93 %). The EST-SSRs length was mostly distributed from 15 to 24 bp, accounting for 86.54 % of the total EST-SSRs. The dominant repeat motif in EST-SSRs was CT/GA (1229, 21.22 %), followed by AG/TC (905, 15.63 %), CA/GT (504, 11.32 %), and AC/TG (339, 8.70 %) (Fig. 3). The four types of repeat motifs mentioned above represented about 56.87 %, whereas the remaining types of repeat motifs only accounted for 43.13 %.

Table 2 The distribution of EST-SSRs based on the number of repeat units
Fig. 3
figure 3

Frequency distribution of EST-SSRs based on motif sequence types. The frequency of main motif types was shown

Using the EST-SSR-containing sequences as a source, a total of 100 potential marker sites were randomly selected to validate the assembly quality and develop EST-SSR markers. Of the 100 primer pairs, 79 primer pairs were successfully PCR amplified with the N. nidus genomic DNA (Supplementary Table S5). The remaining 21 primers failed to generate PCR products at various annealing temperatures and Mg2+ concentrations and were therefore excluded from further analysis. Among the 79 successful primer pairs, 65 primer pairs generated PCR products at the expected size, 10 generated PCR products larger than expected, and 4 generated PCR products smaller than expected. A total of 306 amplifying bands were detected with 79 primer pairs, and the number of amplifying bands per primer pair ranged from one to eight, with an average of 3.87 per marker. The 79 primer pairs were further examined with 10 fern germplasms as PCR templates; 68 and 11 primer pairs could amplify polymorphic (alleles of different size) and monomorphic products, respectively. Supplementary Figure S5 shows the polymorphic bands amplified by three primer pairs. The polymorphic EST-SSR markers are important for research, including genetic diversity, cultivar identification, genetic diversity analysis, and marker-assisted selection breeding in N. nidus. These results clearly demonstrated that developing SSR markers based on assembled unigenes derived from Illumina paired-end sequencing in N. nidus is an effective and feasible approach.

Discussion

Illumina sequencing technology is an extremely high-throughput and effective method to obtain large amounts of transcriptome data, which is essential for identifying novel genes and developing molecular markers. Recent algorithmic and experimental advances are likely to increase the applicability of Illumina sequencing and de novo assembly, which has been successfully used in model and non-model plants (Libault et al. 2010; Xie et al. 2012; Verma et al. 2013; Jia et al. 2015). Consistent with these reports, the results from our study also suggested that short reads from Illumina sequencing can be effectively assembled and used for gene identification and EST-SSR marker development in non-model plants. About 10.66 Gbp of data were generated and assembled into a reference transcriptome, which was the first library in N. nidus to be reported. The assembly result showed that the average length of all unigenes was 969 bp, which was longer than that documented in previous studies, such as tea (402 bp), sweet potato (581 bp), and sesame (629 bp) (Wang et al. 2010; Wei et al. 2011; Tan et al. 2013). There were two possible reasons for obtaining longer unigenes in our study. First, longer paired-end reads (100 bp) were obtained than those of a previous study (75 or 90 bp). Second, the novel assembly program Trinity was utilized, which can recover more full-length transcripts across a broad range of expression levels and provide a unified solution for transcriptome reconstruction in species without a reference genome (Grabherr et al. 2011). These results suggested that the transcriptome data of N. nidus were effectively assembled, which was further validated by the high proportion of unigenes searched with public protein databases.

Further, assembly validation may be done by comparison with protein databases. In this study, 60.23 % unigenes had homologs in the NCBI NR or Swiss-Prot protein databases, suggesting their relatively conserved functions. The lengths of the assembled unigenes are crucial in determining the level of significance of a BLAST match (Shi et al. 2011). We found that the proportion of the assembled unigenes with BLAST matches decreased with the decrease of the length of the assembled unigenes in our study. Close to 40 % of the N. nidus unigenes had no BLAST matches in protein databases, which might be due to the relatively short length or poor alignment of the assembled unigenes. The shorter sequences may lack conserved functional domains, or they may contain a known protein domain but not show sequence matches because of the short query sequence, resulting in false-negative results (Hou et al. 2011). However, 345 unigenes with sequence length longer than 1000 bp had no BLAST matches. The lack of BLAST matches for these unigenes was not because of a shorter sequence length but because of a genuine lack of hits to sequences in database.

In this study, many unigenes matched to known proteins were assigned to a wide range of GO categories and COG classifications (Supplementary Fig. S2, S3), suggesting that our sequencing data represented a wide diversity of transcripts in N. nidus. Of the GO category, cell part and binding activity were the largest subcategories in the cellular component and molecular function categories, respectively. This result was consistent with that in litchi, sweet potato, and alfalfa (Wang et al. 2010; Li et al. 2013). As for the COG classification, the second and third largest classifications were separately replication, recombination and repair, and transcription, which was consistent with other studies (Wang et al. 2010; Wei et al. 2011; Li et al. 2013). Based on the KEGG pathway database, most representative unigenes were mapped to specific pathways, such as metabolism pathways, biosynthesis of secondary metabolites, transport and catabolism, translation, and plant hormone signal transduction. GO, COG, and KEGG classifications of the N. nidus data set revealed that the assembled unigenes had diverse molecular functions and were involved in many metabolic pathways, indicating the diversity of the assembled unigenes while reflecting the global landscape of the transcriptome. Many of these novel unigenes might represent potential N. nidus-specific genes. In summary, the large number of genes we detected should provide sufficient transcriptomic sequence information to increase our understanding of the processes regulating growth and development in N. nidus.

It is well known that SSR markers are important tools for analysis of genetic diversity, comparative genetics, phylogenetic relationship, linkage mapping, QTL analysis, and marker-assisted selection (Saha et al. 2006; Cavagnaro et al. 2010). Therefore, the development of SSR markers from the N. nidus transcriptome could be more useful for genetic studies and breeding applications. In our study, a total of 5792 potential EST-SSRs were identified based on assembled sequences generated from Illumina paired-end sequencing, and 11.71 % unigene sequences possessed SSRs. The EST-SSR frequency was 1 per 4.89 kb (5792 SSRs in 28.29 Mb), which was similar to that in poplar (1/4 kb) and radish (1/3.4 kb) (Tuskan et al. 2004; Wang et al. 2012a, b), much higher than that in rice (1/40 kb) and mungbean (1/67 kb) (Temnykh et al. 2001; Tangphatsornruang et al. 2009), and lower than those in cucumber (1/1.8 kb) and tea (1/2.4 kb) (Cavagnaro et al. 2010; Tan et al. 2013). The EST-SSR frequency is dependent on several factors, such as the genome structure, size of sequence data analysis, the mining tool used, and the parameters for exploration of SSRs (Gupta et al. 2003).

Dinucleotide repeats were the most abundant motif type, followed by trinucleotide and tetranucleotide repeats in our study, which was consistent with previous reports (Wei et al. 2011; Li et al. 2012; Tan et al. 2013). The most abundant di- and trinucleotide motifs were CT/GA and ACC/TGG, respectively (Fig. 3). This finding is consistent with the results reported for other plant species (Kantety et al. 2002; Kumpatla and Mukhopadhyay 2005; La et al. 2005; Wang et al. 2010; Li et al. 2012; Kumar et al. 2014). It was speculated that CT repeats are typically found in transcribed regions and occur at a high frequency in 5′-UTRs, and they may be involved in antisense transcription and play a role in gene regulation (Martienssen and Colot 2001; Wang et al. 2012a, b). The repeat GA-rich was the most abundant dinuleotide motif, whereas TA-rich was the most frequent motif in poplar, sorghum, and mungbean sequences (Tuskan et al. 2004; Tangphatsornruang et al. 2009; Yonemaru et al. 2009). The proportion of GC-rich motifs was the smallest in the dinucleotide repeat, which is in agreement with pervious studies (Wei et al. 2011; Yu et al. 2011; Zhu et al. 2012). The other motif repeats appeared to be evenly distributed with a frequency lower than 1 % except AAG/TTC, ACC/TGG, ACT/TGA, AGG/TCC, AGT/TCA, ATC/TAG, CAG/GTC, CCT/GGA, CGA/GCT. and CTC/GAG in trinucleotide repeats. Differences in SSRs abundance in different studies from different plant species are often seen. However, some motifs showed a greater abundance in most plants, such as AT/TA and AG/TC (Morgante et al. 2002).

The majority of EST-SSR primers generated high-quality amplicons, suggesting that EST-SSRs are suitable for specific primer designs. In this study, 100 pairs of PCR primers were designed and used to assess assembly quality, and 79 primer pairs (79 %) successfully amplified PCR fragments. The failure of 21 primer pairs to produce amplification may have been caused by the large introns, the location of the primer across splice sites, chimeric primers, or poor-quality sequences (Varshney et al. 2005). Of these 79 working primer pairs, 65 amplified PCR products of the expected size. The deviation of 14 primer pairs from the expected size may be due to the presence of introns in the amplicons, large insertions or repeat number variations, or a lack of specificity (Saha et al. 2006). The possibility of assembly errors also cannot be ruled out. These PCR results validate the quality of our assembled unigenes and demonstrate the possible utility of the EST-SSRs produced in this study. The large number of EST-SSRs identified from our transcriptome data will be used for population genetics, genetic diversity, linkage mapping, comparative genomics, and other genetic studies of N. nidus. The results suggest that the unigenes assembled were of high quality and that the EST-SSRs identified in our data set will provide a wealth of resources for developing EST-SSR markers in N. nidus.