Introduction

Microsatellites, also known as simple sequence repeats (SSRs), are tandem repeats of one to six nucleotides in DNA sequences (Oliveira et al. 2006). Given their extensive distribution in genome, high polymorphism, codominant inheritance and high success amplification rate, microsatellites have been one of the most powerful and valuable molecular tools in many research areas, such as population genetics, conservation genetics, genome mapping, parentage analysis, and quantitative trait loci identification (Chang et al. 2009; Montanari et al. 2016; Xue et al. 2014). Despite advances in the achievement of single nucleotide polymorphism (SNP) data, microsatellites are still useful and more easily accessible for many studies such as those involving genetic diversity monitoring for a long period of stock management, and breeding or pedigree estimation (Hodel et al. 2016; Minegishi et al. 2015; Stabile et al. 2016; Weinman et al. 2015; Zalapa et al. 2012). A major limitation to the usage of microsatellites is that traditional methods for microsatellites development, such as an enriched library followed by cloning and Sanger sequencing, were labor-intensive, time-consuming, and expensive (Glenn and Schable 2005; Zane et al. 2002). Furthermore, microsatellites had to be developed de novo for every species under study, as cross-amplification from congeneric species is not generally feasible (Schoebel et al. 2013). Therefore, rapid and cost-effective methods for microsatellite development are urgently needed for population management and conservation of non-model species.

Next generation sequencing (NGS) can produce large amount of sequences, from which numerous genome-wide and gene-based microsatellite loci could be isolated and developed (Zalapa et al. 2012). So far, studies using NGS to develop microsatellite loci have largely rely on the Roche 454 and Illumina sequencing platforms (Hodel et al. 2016; Minegishi et al. 2015; Schoebel et al. 2013; Zalapa et al. 2012). Since read length is an important factor that affects the possibility to discover microsatellites and design primers, the 454 sequencing platform was used extensively for microsatellite development (Hodel et al. 2016; Mastretta-Yanes et al. 2015). However, 454 is less cost-effective than Illumina on a per-megabase basis (Glenn 2011; Zalapa et al. 2012). Moreover, Roche have discontinued the use of the 454 instrument since 2016. Currently, projects of microsatellite discovery largely focused their efforts on Illumina platforms. However, the short read lengths obtained with Illumina platform limited its utility for microsatellite development, because most reads did not have enough flanking sequences for primer design. To improve the efficiency of microsatellite development using Illumina reads, sequence assembly should be performed to create longer DNA sequences or contigs before microsatellite discovery.

One cost-efficient and practical strategy to develop microsatellite markers using NGS technologies is the sequencing of a reduced representation genomic library (Bonatelli et al. 2015). Restriction site-associated DNA sequencing (RAD-seq) is a useful approach to create reduced representation genomic libraries and provide sequence data adjacent to restriction enzyme recognition sites (Davey et al. 2011; Hohenlohe et al. 2013). RAD-seq incorporates a random shearing step in library preparation, which can be modified to generate overlapping paired reads. The reads of a single RAD locus generated by traditional RAD-seq technology could be firstly clustered using the similarity of the first reads with the restriction enzyme recognition site, and then the overlapping paired-end reads allowed local assembly of contigs containing both the forward and reverse reads of each pair (Hohenlohe et al. 2013), which could improve accuracy and quality of the assembled contigs, and therefore improve the success rate of microsatellite development. The roughskin sculpin Trachidermus fasciatus Heckel (Scorpaeniformes: Cottidae), is a small, benthic, carnivorous, and catadromous fish species with a native distribution in Northwestern Pacific distribution (Onikura et al. 2002; Wang 1999). In the past decades, it has experienced severe population declines in China, probably due to degradation of habitats, water pollution and dam construction (Wang and Cheng 2010). However, only a few molecular genetic resources are publicly available for roughskin sculpin (Xu et al. 2008; Zeng et al. 2012), and the use of microsatellite markers in conservation genetic studies and maker-assisted selection was limited (Li et al. 2016b).

In the present study, a “RAD-seq-Assembly-Microsatellite” approach was developed and applied in the roughskin sculpin as a representative of non-model species, for which limited genetic data were available. To improve the success rate of microsatellite development in a simple, fast, and economic way, the advantages offered by the traditional RAD-seq technology coupled with fast and efficient bioinformatic tools for reads assembly, microsatellite isolation and primer design were explored in this approach. The essence of the approach is to generate enough long contiguous sequences of high quality to overcome technical limitations introduced by short read lengths and to isolate abundant microsatellite loci. Briefly, genomic DNA of a roughskin sculpin individual was sequenced using the overlapping paired-end RAD-seq protocol, and the generated reads were sorted according to RAD loci and locally assembled to achieve longer contiguous sequences. Then microsatellite sequences in the assembled contigs were searched and primer pairs were designed. Finally, 52 microsatellite loci were randomly selected and validated in two roughskin sculpin populations based on PCR amplification and genotyping. The newly developed rapid and cost-effective approach would be of particular advantage for the isolation and characterization of sufficient microsatellite loci for ecological and evolutionary studies of non-model species.

Materials and methods

Sampling and genomic DNA extraction

A total of 48 individuals were collected from two geographic sites in China: 24 individuals from Dandong, Liaoning Province (39°46′N, 124°20′E) in May 2014, and 24 individuals from Fuyang, Zhejiang Province (30°03′N, 119°58′E) in January 2014. Muscle tissue were preserved in 95% ethanol. Genomic DNA was extracted using the standard phenol–chloroform extraction protocol, and checked using 1% agarose electrophoresis and Nanodrop 2000c spectrophotometer.

Library preparation and RAD tag sequencing

Approximately 1 μg of genomic DNA extracted from a single individual of Fuyang was digested with restriction enzyme EcoRI. The digested products were ligated to a modified Illumina P1 adapter containing individual-specific index sequences of 6 bp for sample tracking. The total genomic DNA samples were then randomly sheared to an average size of 500 bp, and fragments with insert size spanning 200–600 bp were isolated using a MinElute Gel Extraction kit (Qiagen). An “A” base overhangs were added to the 3′ ends of the blunt DNA fragments, and then a modified P2 adapter containing a 3′ dT overhang was ligated onto the ends of DNA fragments with 3′ dA overhangs. Finally, the library was enriched by high-fidelity PCR amplification, preparing RAD tags that contain both adaptors for paired-end (2 × 125 bp) sequencing on an Illumina Hi-Seq 2500 platform at Novogene in Tianjin.

RAD data assembly and assessment

Illumina raw reads were quality-filtered, and PCR duplicates were removed by “clone_filter” in STACKS (version 1.32) (Catchen et al. 2013). The first reads with restriction enzyme recognition sites were sent to STACKS to identify RAD loci. The minimum depth of stacks was set to 10, and the number of mismatches allowed between stacks was set to 3 to maintain the true alleles from paralogues. Deleveraging and removal algorithms were turned on to filter out highly repetitive loci. Finally, the second reads corresponding to each RAD locus were collected into separate files using a modified version of “sort_read_pairs.pl” in STACKS. The reads for each locus were locally assembled by CAP3, which is a DNA sequence assembly program based on overlap-layout-consensus methods (Huang and Madan 1999). The assembly was performed by a custom developed multi-threading Perl scripts CP3_Opti.pl (available at https://github.com/lyl8086/RAD_SSR) according to an optimized assembly approach. Firstly, the second reads for each RAD loci identified by the first reads were locally assembled into contigs. Secondly, the assembled contigs of the second reads were merged with the corresponding consensus sequences of the first reads for each RAD locus. Thirdly, a final assembly was performed on each RAD loci to generate the final assembled RAD reference. In general, the overlapping paired reads generated by RAD-seq are staggered over a local genome location, these reads can be locally assembled into high-quality contigs, which are up to 1 kb depending on the strategy of size selection in the library preparation. The longer contigs thus provided sufficient sequences for the downstream microsatellites discovery and primer design.

In order to check the quality of the assembled RAD reference, the paired reads used for assembling were mapped back to the reference by BWA 0.7.12 (Li and Durbin 2009). BWA “mem” (Li 2013) was used to generate SAM file, the parameters were set to default except for the minimum seed length of 32. The SAM file was processed by SAMTOOLS 1.3.1 (Li et al. 2009) to check the overall coverage, the number of mapped reads and the depth. To further improve the quality of the assembled RAD reference, only contigs with properly mapped read pairs (paired reads mapped in right direction with proper insert size given by the aligner) and a minimum mapping quality of 20 were retained. Soft or hard clipped reads, secondly aligned reads, and reads with the SAM tags of “XA” or “SA” were also removed. The generated high-quality contigs were then used for downstream microsatellites discovery and primer design.

Microsatellites searching and primer design

QDD 3.1.2 (Meglécz et al. 2010) was chosen for microsatellite discovery and primer design. The program was run in a local Galaxy (Afgan et al. 2016) platform with default parameters. Microsatellite was defined as pure or compound tandem repeats of di- to hexa-nucleotide motif with at least five uninterrupted repeats. To improve the success rate, primers were selected based on the following five criteria: (1) select one primer for each locus; (2) select pure microsatellites with repeat number great than 5; (3) select primers that were only in design category A; (4) remove primers with alignment score greater than 10; and (5) select primers that were away from the target microsatellite (>10 bp).

Microsatellite genotyping and polymorphism survey

A total of 52 primer pairs were randomly selected for laboratory verification. Initial testing for PCR amplification used two individuals from Fuyang. A M13-tail (5′-GGAAACAGCTATGACCATG-3′) was added on the 5′ end of each forward primer. PCR amplification were performed in a total volume of 10 μL containing 10 ng genomic DNA, 1× PCRmix (Dongsheng Biotech Co., China) and 0.2 μM each primer, using the following cycling conditions: (1) initial activation step for 5 min at 95 °C; (2) 35 cycles of denaturation at 95 °C for 20 s, annealing at 52 °C for 30 s and extension at 72 °C for 30 s; and (3) a final extension of 5 min at 72 °C. The PCR products were electrophoresed on a 1.5% agarose gel and only primers that produced specific products were further evaluated using an initial set of eight individuals. PCR amplification were carried out in a final volume of 10 μL containing 10 ng genomic DNA, 1× PCRmix (Dongsheng Biotech Co., China), 0.02 μM forward primer, 0.2 μM reverse primer, and 0.2 μM of M13-tail primer that was fluorescently labeled with FAM, HEX or TAMRA. The PCR amplification program was the same as mentioned above. Fluorescently labeled PCR fragments were electrophoresed on an ABI 3730xl automated sequencer (Applied Biosystems) with the GS-500 size standard. Allele calling was performed using GeneMarker (SoftGenetics, State College, USA). The final scoring was manually checked to minimize genotyping errors. Finally, polymorphism of microsatellite loci screened out by the above two steps were checked in 48 individuals from Fuyang and Dandong.

Genetic diversity indices for each loci and population including observed heterozygosity (H O), expected heterozygosity (H E) and polymorphism information content (PIC) were calculated using the Excel Microsatellite Toolkit (Park 2001). The number of alleles (Na), allelic richness (A R) and inbreeding coefficient (F IS) was calculated using FSTAT 2.9.3 (Goudet 2001). Deviations from Hardy–Weinberg equilibrium and genotypic linkage equilibrium were tested with Genepop 4.5.1 (Rousset 2008). The significance tests were estimated by the Markov chain Monte Carlo (MCMC) method (10,000 dememorization steps, 1000 batches of 10,000 iterations). Micro-checker 2.2.3 (van Oosterhout et al. 2004) was used to test for the presence of null alleles. A standard Bonferroni correction was used for all above significance levels of tests.

Results

RAD sequencing, filtering and assembly

A total of 33.5 million raw paired reads were obtained, and 25.1 million clean paired reads were retained after quality filtering and removing PCR duplications. A total of 137,409 loci identified by STACKS were exported into separate fasta files for local assembly. CAP3 assembled a total of 127,864 contigs with a mean length of 517 bp and N50 of 543 bp. About 20.8 million reads could be mapped back to the assembled contigs, and 94% of these were properly paired. After retaining contigs with properly paired reads and a minimum mapping quality of 20, and removing clipping and other possible spurious reads, a total of 121,750 contigs were retained as the final assembled RAD reference for microsatellites discovery (Online Resource 1). The final assembled RAD reference had a mean length of 522 bp and GC content of 41.59%.

Microsatellite isolation and characterization

A total of 19,782 contigs possessing microsatellite motifs were identified. For 16,497 contigs that contained priming sites for microsatellite loci, the types of the microsatellites in the target region were variable. The number of the pure microsatellites was 12,127, while the number of the compound microsatellites was 3242. Finally, a total of 156,150 primer pairs sets were successfully designed (Online Resource 2). Using one primer pair for each locus, pure microsatellites, design category A, PCR primer align score ≤10, and minimum primer target distance >10 bp, a total of 1854 primer pairs were retained (Table 1). These 1854 microsatellite motifs included 1536 di- (82.85%), 262 tri- (14.13%), 49 tetra- (2.64%), 5 penta- (0.27%) and 2 hexa- (0.11%) nucleotide repeats, of which the repeats number ranged from 5 to 41.

Table 1 Summary of the number of primer pairs designed from the 16,497 assembled RAD contigs for Trachidermus fasciatus

Of the 52 primer pairs randomly selected, 48 primer pairs produced clear and specific amplification products of the expected size by being screened in 1.5% agarose electrophoresis in two individuals, and were subsequently used for evaluation with capillary to test genotyping in eight individuals. Finally, a set of 45 microsatellite loci were used to evaluate polymorphism in 48 individuals from the two populations, and a total of 618 alleles were detected (Table 2). The number of alleles per locus ranged from 3 to 29, and the expected (H E) and observed (H O) heterozygosity ranged from 0.3510 to 0.9800 and from 0.2080 to 1.0000, respectively. The polymorphism information content (PIC) ranged from 0.3070 to 0.9590. No linkage disequilibrium was detected between microsatellite loci. Significant deviation from Hardy–Weinberg equilibrium was observed in four loci (tfa 28, tfa 32, tfa 36 and tfa 57), of which two loci (tfa 32 and tfa 57) were significant in both tested populations (Table 2). Analyses using micro-checker indicated the presence of null alleles at the same four loci.

Table 2 Characterization of 45 microsatellite loci validated in populations of Trachidermus fasciatus from Fuyang and Dandong

Discussion

The present study, with the roughskin sculpin as a representative of non-model species, developed a rapid and cost-effective microsatellite identification approach (RAD-seq-Assembly-Microsatellite) using paired-end RAD-seq. This approach can create longer sequences by assembling the overlapping paired-end RAD reads for microsatellite discovery, which overcomes the issues of short read lengths generated by the Illumina platform and improves microsatellite detection rates. The approach could efficiently generate a large set of polymorphic microsatellite markers for a wide range of applications from population genetics, behavioral ecology, to marker-based breeding programs, especially for non-model species.

Next-generation sequencing (NGS) technologies have enhanced our ability to obtain hundreds of microsatellites in a rapid and low cost manner, and dramatically accelerated the discovery of genomic information even in non-model species (Cai et al. 2013; Davey et al. 2011; Hu et al. 2016). Compared with traditional methods for microsatellite markers development (Zane et al. 2002), our approach, which based on RAD-seq and de novo local assembly, is more efficient in terms of money and time. At current market prices, the RAD library construction costs approximately $75, and a 125 bp × 2 paired-end sequencing run on the Illumina Hiseq 2500 platform to produce 1 Gb of sequences costs approximately $70. The Illumina Hiseq 4000 and X-ten platforms have much lower per-base sequencing costs with higher throughout than Hiseq 2500 platform (http://www.illumina.com). So the total costs for library construction and sequencing using our approach ($360 for 4 Gb data) are lower than that using traditional (at least $800 for 100 sequence data) approaches (Zalapa et al. 2012). In general, traditional approaches require 2–4 weeks for DNA extraction, library construction, cloning, Sanger sequencing and primer design, whereas our approach requires only 1–2 weeks for DNA extraction, RAD library construction, Illumina sequencing and primer design. Moreover, our approach can identify thousands of microsatellites and then batch design primers simultaneously, while traditional approaches can usually find repeated units in a relative small pool of sequences (Zane et al. 2002).

There have been studies that used the next generation sequencing technologies to develop microsatellite markers (Bonatelli et al. 2015; Castoe et al. 2010; Hung et al. 2016; Li et al. 2016a, b; Minegishi et al. 2015). The two major NGS platforms used for the discovery of microsatellites are Roche 454 and Illumina sequencing (Zalapa et al. 2012). The main advantage of the 454 sequencing for microsatellite discovery is that the read-length is 350–600 bp, which could allow the discovery and development of microsatellites even directly from the raw reads. Castoe et al. (2010) identified 14,612 microsatellite loci in 11.3% of the 128,773 Roche 454 shotgun reads, and 4564 of which had flanking sequences suitable for primer design. Bonatelli et al. (2015) validated 22 (30.56%) polymorphic microsatellites out of 64 loci using double digest restriction site–associated DNA sequencing (ddRAD-seq) on a Roche 454 platform. Seventy-four projects using 454 sequencing reviewed in Hodel et al. (2016) yielded 8–91 polymorphic loci, with an average of 16 polymorphic loci and 4400 potential loci derived from an average of 139,418 reads. The main advantage of using Illumina platform for microsatellite discovery is that the much higher throughout and lower costs than 454 platform (Zalapa et al. 2012). Cai et al. (2013) assembled the generated Illumina shot-gun reads into a draft genome of Anisogramma anomala, and successfully amplified 214 (90.7%) microsatellite loci with specific products. Hung et al. (2016) applied Illumina shotgun sequencing to Apodemus semotus and mapped the obtained sequences reads against the genome of Mus musulus, then successfully amplified 44 (74.57%) of 59 microsatellite loci. Hu et al. (2016) used transcriptome data from RNA-Seq using Illumina sequencing and found that 20 (31.75%) loci were successfully amplified and also were polymormic. For microsatellites development studies using Illumina sequencing reviewed in Hodel et al. (2016), the average number of polymorphic microsatellite markers reported was 15, and the average number of potential loci per study was 15,539.

However, the Illumina platform generates relative short reads (100–300 bp), so assembly is usually required to achieve longer contiguous sequences, which could provide sufficient flanking sequence for the design of primers to amplify the target microsatellite and reduce redundancy of closely linked microsatellites (Zalapa et al. 2012). Yang et al. (2016) only identified 650 microsatellite loci from 4.5 million RAD raw reads and only 285 (43.84%) primer pairs were successfully designed. However, in the present study, 22,835 microsatellites were discovered in 121,750 contigs assembled, and 156,150 primer pairs were designed for 16,497 (77.24%) loci containing microsatellites. Therefore, careful consideration should be given to the quality of assembly. RAD-seq is a family of genomic approaches that provide sequence data adjacent to restriction enzyme recognition sites (Davey et al. 2011; Hohenlohe et al. 2013). Furthermore, the overlapping paired-end reads by traditional RAD-seq technology allowed local assembly of contigs containing both the forward and reverse reads of each pair. These RAD contigs are anchored at one end by the restriction enzyme recognition site and contain several hundred base pairs of continuous genomic sequence data (Hohenlohe et al. 2013). This assembly method for RAD holds several advantages comparing to whole genome sequencing. Firstly, reads of a single RAD locus could be clustered before assembly using the similarity of the first reads with the restriction enzyme recognition site, and therefore reducing complexity and the computational costs, which is even affordable for desktop computer. Secondly, local assembly of the contigs could improve the quality of de novo assembly, and therefore improve the success rate of microsatellite development. Thirdly, the size selection in library preparation is flexible, which makes the length of contigs easily customized. In our study, PCR amplifications were successful for 48 (92.31%) of the 52 randomly selected loci, which was higher than those of the other approaches above. This indicated that primer pairs in the database and the assembled RAD contigs were of high quality and most of primer pairs would amplify their targets. Compared to other approaches using next-generation sequencing, our assembly based approach exhibited great advantages on developing thousands of microsatellites rapidly and accurately, especially for non-model species with shallow background of genomic information.

In conclusion, the present study has contributed a detailed approach to rapidly and cost-effectively develop genome-wide microsatellite markers in non-model species with high success rate. A total of 45 polymorphic loci were validated, which could serve as a proof-of-concept showing that the “RAD-seq-Assembly-Microsatellite” approach was successfully applied to a non-model species. The “RAD-seq-Assembly-Microsatellite” approach developed in the present study holds great promise for microsatellite development in future ecological and evolutionary studies of non-model species.