Introduction

Molluscs are the most important group of animals after fish contributing to world aquaculture products. However, they have seen far less attention in terms of genetic improvement programs, the characterisation and description of genetic variability and genetic-based management programs [1]. Microsatellites have been the molecular markers of choice for estimating genetic variability in natural and domesticated populations of animal and plants [2, 3]. Despite their popularity and utility, a major limitation to the use of microsatellites for non-model species is that, in general, loci have to first be identified and developed specifically for that species, which is a time and resource consuming process [2, 4]. However, next-generation sequencing (NGS) platforms are being increasingly used to support microsatellite marker identification and development [57], greatly decreasing the time and cost of this tedious process [8]. In addition, NGS approaches facilitate the identification of loci suitable for multiplexing, further increasing the efficiency of genotyping [2], which is also now being undertaken using NGS platforms [9, 10].

In addition to their growing importance for aquaculture, clams are important components of near-shore soft sediment environments and many are of significant commercial value as fisheries. The family Mactridae contains a diverse group of approximately 350 species of ecologically and commercially important clams with a global distribution [11, 12]. Species live by burrowing into sandy or gravelly substrates usually below the low water mark and there is increasing evidence of cryptic speciation in the group [12, 13]. In many countries, especially in the tropics, knowledge of genetic structure of species is lacking due to limited genetic resources [1].

Species of snout otter clams placed within the genus Lutraria are sought after by commercial fishers and aquaculture methods are increasingly being developed in Asian countries for several species [11, 14]. The species Lutraria rhynchaena is becoming an important aquaculture species in countries such as Vietnam [11]. However, there is a lack of genetic resources available for this species and the genus more generally [12, 15, 16], which constrains the effective development of stocks for aquaculture and the development of management plans for translocation of broodstock and seedstock.

Panels of microsatellite loci have been developed for a number of commercially or ecologically important bivalve species using both traditional approaches [17, 18] and more recently, NGS-based approaches [16, 19].

The objectives of our study were to: (1) sequence and assemble the partial genome of the commercially important snout clam Lutraria rhynchaena, (2) identify and validate microsatellite loci from the assembled partial genome using amplicon-based next generation sequencing, and (3) assess the potential of the identified microsatellite loci for population genetic studies.

Materials and methods

Sampling

A total of 104 clam samples representing wild and cultured populations from north and central Vietnam were used in this study (Table 1).

Table 1 Collecting site, population type, population codes and number of clam samples used in this study

Partial genome sequencing

Approximately 1 µg of genomic DNA was extracted from a muscle sample of a clam from Van Don Province (VD) using DNAeasy Blood and Tissue Kit (Qiagen, Hilden, Germany). The purified genomic DNA was quantified with Qubit HS (Invitrogen, USA) and normalized to 2 ng/μL and subsequently processed using Nextera-based library preparation (Illumina, San Diego, CA) following the manufacturer’s instructions. Quantification and size estimation of the library was performed on a Bioanalyzer 2100 High Sensitivity DNA chip (Agilent, Santa Clara, CA). Next, the library was normalized to 2 nM and sequenced on the MiSeq Benchtop Sequencer (2 × 250 bp paired-end reads) (Illumina, USA). The reads were assembled de novo into contigs using IDBA-UD (–mink 31 –maxk 251 setting) [20].

Microsatellite isolation and characterization

The open-source QDD version 3 [21] was used to identify contigs possessing microsatellite motifs as well as to design primer pairs suitable for the amplification of these loci. Primers were subsequently filtered based on suggestions by the authors of the software [21]. A selection of contigs including di-, tri-, and tetra-nucleotide repeats were used for subsequent analysis. A total of 48 loci were initially screened for amplification success and for presence of polymorphism using template DNA from eight clams, representing four sampling location from north and central Vietnam. Primers were pooled for the co-amplification of suitable loci by multiplex PCR using a QIAGEN multiplex kit and an Eppendorf MastercyclerS gradient PCR machine following the protocol described by Blacket et al. [22]. Illumina adapters were attached to the purified amplicons using NEXTflex DNA preparation kit (BiooScientific, Austin, TX).

Preliminary sequencing results identified a subset of 12 suitable polymorphic loci. The multiplex primers were re-designed to contain partial Illumina adapter, allowing for rapid and economical library construction using a 2-step PCR method. Briefly, a multiplex PCR was performed on each clam sample and the amplicons were purified and size-selected using Ampure Bead XP (0.9 × volume ratio). A second PCR was carried out using Illumina Nextera-based barcode primers to generate amplicons containing the complete Illumina adapter and unique barcode. After the second PCR, the amplicons were purified using Ampure Bead XP (0.8 × volume ratio), quantified using KAPA Library Quantification kit (KAPA Biosystems, Cape Town, South Africa), normalized, pooled and sequenced on the MiSeq (2 × 250 bp paired-end run).

Raw sequence quality control and bioinformatics processing

The raw paired-end reads were adapter-trimmed and overlapped using Trimmomatic [23] and PEAR (setting: −q 15, −m 150) [24], respectively. Then, the processed reads were filtered for reads containing both the complete forward and reverse primer sequences using a grep command. Next, the reads were mapped to the assembled contigs using Bowtie2 (with—very-sensitive option) [25]. PCR amplicon lengths were genotyped for variation in repeat motif from alignments using Geneious v.7.0.4 [26] by summarising the read length distributions for each locus in frequency histograms and using established criteria for genotyping microsatellite loci (see pages 599 and 603 in [2]).

The software GeneAIEx (http://biology-assets.anu.edu.au/GenAlEx/Welcome.html) was then used to estimate expected (H E) and observed (H O) heterozygosities and number of alleles (NA), while conformity to Hardy–Weinberg equilibrium (HWE) expectations, inbreeding coefficient (F IS) and linkage disequilibrium estimates between all pairs of loci were examined using the open-source GENEPOP on the web v4 [27]. Bonferroni corrections [28] were used to adjust significance values for multiple comparisons. Lastly, loci were assessed for null alleles and scoring errors using MICRO-CHECKER [29]. Pairwise comparisons of allelic frequencies between populations VDT (north Vietnam) and NTT (central Vietnam) were carried out using the G-test of independence also implemented by GENEPOP [27].

Results and discussion

Next-generation sequencing and de novo genome assembly

A total of 4,214,000 paired-end genomic reads consisting of 622.5 Mb of data were obtained from the library using the MiSeq platform. These reads were submitted to the Sequence Read Archive [SRA: ERR955910]. An assembly of these reads produced 22,193 contigs with a N50 of 661 bp.

Microsatellite isolation and characterisation

Using the assembled contigs, a total of 916 contigs possessing microsatellite motifs were identified by QDD analysis, of which 48 contigs contained priming sites for loci with an estimated melting temperature of approximately 58°. A subset of 12 of these loci were then selected for further genotyping based on the degree of polymorphism and the strength and consistency of amplification using an initial set of eight samples. The details of these loci and accession numbers are provided in Table 2.

Table 2 Primer sequence and characteristics of 12 microsatellite loci developed for L. rhynchaena and amplified in a single multiplex PCR reaction

The set of loci was found to have low to moderate genetic variation, with an average of 2.6 alleles per locus (range = 2–4 alleles) and heterozygosity estimates ranging between 0.14 and 0.77 (mean = 0.34; Table 2). There was no strong evidence of linkage between pairs of loci with only one pairwise comparisons being significant, and this became non-significant after Bonferroni correction. With the exception of loci mumLr11 and mumLr12, all loci conformed to Hardy–Weinberg expectations. These 2 loci remained significant after Bonferroni adjustment and showed an excess of homozygotes, as indicated by high F IS values (Table 2) and the primers for locus mumLr12 may have been amplifying a second locus in some individuals as three and sometimes four different length variants at high frequencies were apparent. Analysis using MICRO-CHECKER detected the presence of null alleles at these same two loci and also at locus mumLr8, which was marginally non-significant for HW equilibrium without Bonferroni corrections (p = 0.0518). All loci amplified strongly in two of the three additional populations examined (n = 7–36). Only a few individuals and loci were scorable in clams from site CTT, which also possessed divergent mitochondrial COI haplotypes (unpublished data), suggesting the presence of a cryptic species at this location. Pairwise comparisons of allelic frequencies between population VDT (north Vietnam) and NTT (central Vietnam) for the 10 loci giving non-significant HWE results produced four significant results (Table 2), with two comparisons giving P-values <0.001. Thus, these loci have the potential to be used to address questions relating to population differentiation and gene flow in this species.

The allelic diversity and heterozygosity in L. rhynchaena is lower than in a number of other bivalve mollusc species [18, 30]. This may reflect either the small sampling of loci, an inherent characteristic of the species or the population genetic history of the sample that could have suffered a population bottleneck or be the result of human translocation. It is noteworthy that a number of recent studies have noted reduced allelic variation in clam species [31, 32] including one on the surf clam species, Mactra chinensis, from the same family [16]. Only 2 of the 12 loci showed deviation from HWE, which is a lower proportion than in many similar studies of bivalve molluscs [18, 33]. In general, utilizing NGS on any given species enables the identification of hundreds or sometimes thousands (depending on sequencing depth) of candidate microsatellite loci. A high number of candidate loci ensure sufficient loci are retained after strict loci filtration for quality such as pure microsatellite motifs, non-hairpin forming sequences, compatible primer annealing temperature for multiplexing and low or zero hit to transposable elements. Further, the use of NGS platforms for subsequent genotyping not only removes the limitations associated with fragment analysis but also allows for more accurate genotyping, given that amplicon length estimation is based on absolute quantification (base-by-base) instead of relative quantification based on a size standard. Additionally, the ability to differentiate loci bioinformatically allows for the use of loci with overlapping sizes to be amplified in a single PCR reaction, thereby reducing laboratory cost and time [18, 33].

Conclusion

In this study we successfully identified over 900 potential microsatellite loci for L. rhynchaena from contigs assembled from 622.5 Mb of NGS data using the Illumina MiSeq platform. From these contigs, we were able to rapidly identify and characterise 12 polymorphic microsatellite markers suitable for multiplexing with all but two conforming to HWE expectations. Based on pairwise statistical tests, we showed that the microsatellite loci could effectively differentiate clam populations in Vietnam. Thus, we also demonstrated the feasibility of using NGS platforms as a faster, more accurate and potentially cheaper approach for genotyping.