Introduction

Environmental DNA (eDNA) techniques are quickly becoming routine in conservation research and are increasingly viewed by management professionals as a potentially cost-saving alternative to traditional field techniques (Goldberg et al. 2016). Further incorporation of eDNA techniques into management requires the continued development of genetic database resources necessary to support effective implementation. For freshwater fish species, a reasonably comprehensive reference database exists only for the barcoding region of the COI gene (Ward et al. 2009), which is publically available through the Barcode of Life Data System (BOLD; Ratnasingham and Hebert 2007). While this database remains an invaluable tool, it is limited to a small portion of the overall mitochondrial genome. Design of species-specific qPCR primers, universal primers for metabarcoding, and/or species level taxonomic resolution is not always possible using the COI gene. Other gene regions such as 12S ribosomal RNA (rRNA), 16S rRNA and CytB are also taxonomically discriminative (Miya et al. 2015; Olds et al. 2016; Evans et al. 2017) but lack comprehensive species representation in public databases. Continued enhancement of public databases with complete mitochondrial genome sequences would provide data on all 13 protein coding genes, 2 rRNAs, 22 transfer RNAs (tRNAs) and the highly variable control region. Access to mitochondrial sequence data from additional species, and data that include increased geographic variation within a species, will provide greater flexibility in the design and application of eDNA techniques.

Currently, the design of species-specific qPCR detection assays (Farrington et al. 2015; Bronnenhuber and Wilson 2013) is constrained by the limited availability of sequence data outside of the COI barcoding region. Access to sequence data from across the mitochondrial genome provides a greater potential of locating ideal primer and probe annealing sites that convey a high level of species discrimination. Metabarcoding (Hänfling et al. 2016; Olds et al. 2016) is dependent on locating conserved priming sites that bracket a taxonomically informative area of variable sequence. Such sequence characteristics do not readily occur in protein coding genes, thus ‘universal’ primers often target the 12S or 16S rRNA region (Miya et al. 2015; Sarri et al. 2014). Additionally, metabarcoding relies on a database of high quality reference sequences to assign a taxonomic identification to the recovered sequences.

The online resource MitoFish (Iwasaki et al. 2013) provides access to a database of complete and partial fish mitochondrial genomes. Despite collating all publically available mitochondrial genomes of fish, comprehensive species representation within MitoFish is still lacking. Complete mitochondrial genome sequences are available for just 2744 (as of September 2019) of the 34,200 described fish species in FishBase (Froese and Pauly 2017), often with just a single genome representing each species. Ideally, databases should include representative genomes of all species and encompass multiple representatives of each species, including geographic variation to represent localized mutations in mitochondrial sequences that may occur across a species range. Development of such a resource is a large undertaking but is possible with next generation sequencing techniques. Genome skimming using shotgun data (Richter et al. 2015; Gan et al. 2014) and long-range PCR (Briscoe et al. 2013) are two approaches that provide a means to obtain mitochondrial genome data from a large number of individuals/species. Genome skimming is a PCR independent method in which sequenced libraries are composed of approximately 99% nontarget nuclear DNA, resulting in a process that is more expensive and requires a larger investment in computational resources to obtain a complete mitochondrial genome. Long-range PCR is used to enrich for mitochondrial DNA prior to sequencing. Highly enriched samples allow for greater levels of multiplexing resulting in lower sequencing cost and the need for fewer computation resources. However, an upfront investment is necessary to develop the long-range primer sets and obtain the sequencing data necessary for their design. Here, we describe order or family-specific primers for long-range PCR amplification for 6 orders and 9 families that we have utilized to sequence the mitochondrial genomes for 65 different species.

Methods

Tissue collection and DNA extraction

Tissues were collected for various sampling efforts over the past 2 years. When possible, whole fish voucher specimens were retained at the USFWS Northeast Fishery Center in Lamar, Pennsylvania. Fin clips were preserved in 100% ethanol and placed at − 80 °C for long-term storage. Genomic DNA was extracted from fin clips or muscle tissue using the DNeasy® Blood and Tissue Kit (Qiagen, Inc., Germantown, MD, USA). For most species, tissue from multiple individuals was obtained. With the exception of Scaphirhynchus species, all species were field-identified by trained fisheries biologists. Scaphirhynchus species were identified using a suite of microsatellite loci (McQuown et al. 2000; Schrey et al. 2007; Tranah et al. 2004). Two heuristic steps were taken to ensure quality control of sequences submitted to GenBank. The COI barcoding region was used to confirm the field identification of each specimen using BOLD (Ratnasingham and Hebert 2007) and verify sample integrity during processing. In addition, a cluster analysis of all newly sequenced full-length mitochondrial genomes and reference genomes obtained from GenBank was used to screen for potentially chimeric genomes prior to submission.

Primer design and optimization

Complete mitochondrial genome sequences for each order or family group were downloaded from GenBank (Benson et al. 2013) and aligned using the MAFFT algorithm (Katoh and Standley 2013) in Geneious R10 (Kearse et al. 2012). In most cases, this included a non-redundant list of every species with an available NCBI RefSeq (O’Leary et al. 2016) sequence. The total number of available sequences used in long-range primer design alignments varied between taxonomic groups as follows: Acipenseridae/Polyodontidae (15), Clupeidae (13), Catostomidae (13), Cyprinidae (538), Centrarchidae/Percidae (21), Salmonidae (50), Ictaluridae (7). Highly conserved regions were identified visually and Primer3 (Untergasser et al. 2012) was run within Geneious R10 (Biomatters Ltd., Newark, NJ, USA) to locate suitable primer annealing sites. Primer annealing sites were generally located in the rRNA genes and tRNAs due to their higher levels of conservation within order or family groups. Primer sets were chosen to amplify the entire mitochondrial genome in four overlapping sections each with a length of 3000 to 7000 base pairs (Table 1). All primer pairs, except for those designed for Cyprinidae, were designed based on a 100% consensus of the aligned sequences to ensure primer specificity across all species within the targeted taxonomic group. Due to the sequence diversity within Cyprinidae, a 90% consensus threshold was used to identify primer locations that minimized the need for redundant bases. Redundant bases were used sparingly and avoided in the 3′ end of any primer. Primer design for Clupeidae was restricted to genera with representation in freshwater habitats due to primer design difficulty with broader family representation. Primer sets for each mitochondrial genome region were optimized by running temperature gradients and template concentration dilutions to identify optimal amplification conditions, assessed by product evaluation on 1.5% agarose gels (Table 1).

Table 1 Long-range primers used to amplify complete mitochondrial genomes in four overlapping regions

PCR and gel electrophoresis

Long-range PCR was used to amplify the complete mitochondrial genome in four overlapping sections (Fig. 1). Each 25 μl PCR was amplified with either Q5 Hot Start High-Fidelity Master Mix (New England Biolabs, Ipswich, MA, USA) or Kapa HiFi HotStart ReadyMix (Kapa Biosystems, Wilmington, MA, USA) following the manufacturer’s recommended concentrations. Reactions were run under the following conditions: enzyme activation for 2 min at 98 °C, followed by 35 cycles of 20 s denaturing at 98 °C, 20 s at primer annealing temperature (Table 1) and 3 min at 72 °C, followed by a final 7 min elongation at 72 °C. All PCR products were visualized on a 1.5% agarose gel to verify amplification success.

Fig. 1
figure 1

Mitochondrial genome sequencing workflow. Voucher specimens and fin clips samples are used to obtain genomic DNA from fish. Independent long-range PCR reactions amplify the mitochondrial genome in four overlapping regions. Regions are pooled, purified and prepared for sequencing using the Illumina Nextera XT Prep Kit workflow. After Illumina sequencing, reads are demultiplexed and de novo assembled into a circular contig. The consensus sequence is extracted and annotated prior to final quality assurance and submission to GenBank

Illumina sequencing

Successful amplification products were quantified using a Qubit™2.0 (Life Technologies, Carlsbad, CA, USA) and corresponding fragments from the same specimen were pooled in equimolar ratios. Pooled PCR amplicons were bead purified, fluorometrically quantified and diluted to a standard concentration of 0.2 ng/µl. DNA libraries were created using the Nextera XT Library Prep Kit following the manufacturer’s instructions (Illumina, Inc., San Diego, CA, USA). Bead normalized libraries were pooled and sequenced using the MiSeq Reagent V2 Kit with 2 × 250 paired end reads. Sequences were sorted into FASTQ files and trimmed to remove remaining adaptor and index sequences using the onboard Illumina FASTQ workflow.

Mitochondrial genome assembly

After trimming Illumina adaptor and index sequences, FASTQ files were uploaded into Geneious R10 (Kearse et al. 2012) for quality control and assembly. Low quality (Q < 20) bases were trimmed from each end and short reads (< 25 bp) and reads with an average read quality of Q 20 or less were discarded. Reads were merged (merge rate = normal) before error correction and normalization (BBNorm version 37.25; error correction, default settings; normalization, target coverage 60 and minimum depth = 6). The normalized merged reads were de novo assembled using the Geneious assembler under medium sensitivity with the circularize contigs function turned on. A maximum mismatch of 5% was allowed. Occasionally the full mitochondrial genome was not obtained as a single complete circular contig using normalized data. In these cases, a de novo assembly was done using all available merged reads. Consensus sequences were based on the majority base call for each nucleotide position. Gene annotations were mapped to new genomes in Geneious R10 (Kearse et al. 2012) using existing NCBI reference sequences of the same or closely related species as the source genome. All complete genomes were submitted to GenBank (Table 2). Mitochondrial genomes were aligned using MAFFT (Katoh and Standley 2013) and the maximum percent of base pair differences were calculated within each species.

Table 2 Sequenced mitochondrial genomes

Results

The novel order and family-level primer sets presented here allowed for the successful amplification and sequencing of 205 mitochondrial genomes from 9 families of fish representing 65 species/subspecies, 28 of which were not available in GenBank at the time of submission. It was not uncommon to observe a failed PCR reaction in one of the four regions being amplified under the initial PCR conditions. However, the majority of these instances could be corrected by adjusting the annealing temperature, template concentration or a change in Taq polymerase. Overall, amplification success across all primers sets was greater than 90%, with only a few species failing to amplify one or more of the four regions. All mitochondrial genomes assembled had a length (16,486–16,832 bp) and gene composition typical of most fish species (Satoh et al. 2016) including: two rRNA genes, 13 protein coding genes, 22 tRNAs and the highly variable displacement loop (control region). With the exception of ND6, all protein coding and rRNA genes were coded on the heavy strand. Eight tRNA genes (tRNA-Ala, tRNA-Asn, tRNA-Cys, tRNA-Gln, tRNA-Glu, tRNA-Pro, tRNA-Ser, tRNA-Tyr) were coded on the light strand with the remaining 14 coded on the heavy strand.

Sequencing multiple mitochondrial genomes from the same species revealed varying levels of intraspecies genetic variation. The total intraspecies base pair composition differed by a maximum of 2.65% (black crappie) with an average of 0.38%. Seven species showed levels of base pair variation over 1.0%. Four of these (channel catfish, redbreast sunfish, golden shiner, black crappie) come from a geographically disperse area ranging from South Carolina to New York to Michigan, while emerald shiner all originated from New York water. The remaining two species (round whitefish, Alaska; lake sturgeon and emerald shiner, New York) originated from a geographically similar area and the elevated level of variation was due to a variable number of tandem repeats in the displacement loop (Table 2).

Discussion

The primer sets presented here offer a way to obtain mitochondrial genomes from 9 families of freshwater and marine fish species/subspecies and cover approximately 60% of the estimated 1050 species native to North America (Lundberg et al. 2000). While truly universal primers may not be possible, amplification of novel genomes from species not available during initial primer design suggests a broad application of the primers within their respective taxonomic target group. Overall, primers performed as expected and amplified a range of families/genera within their targeted taxonomic group. Primers designed at the order level were successful when the order contained a limited number of families. In the case of Acipenseriformes, there are only two extant family groups with relatively limited species diversity. Primer sets occasionally failed to amplify one of the target regions. However, adjustments in annealing temperature, template concentration or a change in the type of Taq polymerase used generally resulted in successful amplification. The families of Centrarchidae and Percidae were consolidated and a suite of primers were developed to target both families simultaneously. Cyprinidae is a very large family and mitochondrial genomes were available from 538 species for sequence alignment and primer design. In this instance, a 90% consensus of the sequence alignment was used to design family-specific primers. Under the 90% criteria, primer mismatching is possible and primers may show reduced performance with certain species. Design precautions ensured that potential mismatches were reduced in the 3′ end of the primer. All cyprinid species evaluated to date have successfully produced PCR amplicons for all four mitochondrial genome fragments. Sequence alignment of available Clupeidae mitochondrial genomes lacked sufficient conservation to design robust primers at the family level. Primer design was thus restricted to those species found in North American freshwater habitats. The broader family-level applicability of the Clupeidae primer sets remains uncertain.

Amplification of complete mitochondrial genomes in two overlapping regions is possible (Zhu et al. 2013). However, amplification of long PCR templates is sensitive to DNA quality (Deagle et al. 2006). In our laboratory, recently-obtained fin clips stored in 95% ethanol at room temperature showed poor amplification success after more than 3 months of storage. In contrast, fin clips stored in 95% ethanol at − 80 °C provided consistent amplification after storage in excess of 2 years. Overall, we experienced better consistency in amplification success when targeting shorter fragments to obtain the complete mitochondrial genome in four overlapping regions, though amplification in two fragments is possible. Other tissue preservation methods may offer advantages over ethanol (Kilpatrick 2002) and allow better preservation of high molecular weight DNA. The need for large intact fragments of mitochondrial DNA is a methodological weakness of long-range PCR and generally precludes the analysis of archived museum specimens. In this instance, the PCR independent approach of genome skimmer offers a viable strategy to mine the wealth of voucher specimens available through museums.

Emerging techniques in the field of molecular ecology and eDNA are dependent on the continued development of representative reference databases. In particular, multi-gene metagenomics avoids potential PCR bias associated with metabarcoding biodiversity studies by directly sequencing eDNA without a proceeding PCR amplification step (Bista et al. 2018; Tang et al. 2014). Environmental metagenomic strategies have proven effective for detecting insect species from eDNA water samples (Crampton-Platt et al. 2016), but are hindered by low level recovery of mitochondrial DNA relative to non-target genomic material. Both non PCR-mediated (Liu et al. 2016) and PCR-based methods (Deiner et al. 2017) are being used to enrich the mitochondrial DNA fraction prior to sequencing. In each instance, reference genomes are used to enhance the recovery and identification of sequencing reads obtained from mixed species assemblages and are central to the multi-gene metagenomics approach.

Continued expansion of mitochondrial genome databases to include both a greater number of species and increased representation of species from throughout their range will provide an improved basis for analysis. For example, we observed intraspecies variation across the black crappie genome of 2.65%, a comparatively high value relative to the average of 0.38% (Table 2). Further examination clearly shows a geographic component: within species variation from northern locations (New York, Pennsylvania, Lake Erie) was only 0.06% while those originating from a southern location (South Carolina) variated by 0.07%. Two additional species, round whitefish and lake sturgeon, were obtained from geographically similar areas, Alaska and New York respectively, but still had variation in excess of 1%. This variation was attributed to a variable number of tandem repeats found in the displacement loop region. Excluding the displacement loop, both species (round whitefish, 0.17%; lake sturgeon, 0.33%) had a level of variation less than the average across all species examined. Tandem repeats within the displacement loop have been previously described and attributed to adaptation to harsh environments (Hirayama et al. 2010).

Based on the limited data presented here it is not possible to discern the full extent of intraspecies variation, but it does suggest a comprehensive evaluation of the issue is warranted. Intraspecies variation is of concern when designing species-specific eDNA markers and assigning taxonomic designations to sequencing reads in metabarcoding applications. Reference datasets that lack sufficient sequence diversity can result in qPCR markers that perform poorly across a species geographic range. The lack of sufficient sequence diversity will also negatively impact metabarcoding read classification with sequence variants remaining unclassified due to the lack of matching sequences in the reference dataset. Continued expansion of reference data sets to include additional species and sequence diversity is an essential foundational aspect of current and future eDNA applications. Additional sequencing with greater geographic representation will also allow future studies to explore intraspecific variation in a broader context and identify mitochondrial regions most suitable for marker development.

Use of order or family specific primers to easily obtain mitochondrial genome data from a large number of fish species is a valuable asset for applications such as eDNA, molecular ecology, conservation genetics, and phylogenetics. Improved species representation and geographic diversity will increase the efficiency of species-specific primer design for qPCR assays, provide more robust reference sequences for species identification in metabarcoding applications, and provide a basis for increased use of multigene metagenomics applications. Utilization of large mitochondrial genome databases will allow the most taxonomically discriminative marker or marker combinations to be identified, which may require targeting different regions within the mitochondrial genome. It is anticipated that continued expansion and public availability of mitochondrial genome data for fish (and all species in general) will greatly expand future applications of genomic research.