Introduction

In recent years, the DNA barcoding technique has been widely used for species identification. Initially, it was used as an auxiliary tool in species identification for organisms that are difficult to differentiate based solely on their morphological characteristics. However, due to a decrease in the number of taxonomists with the necessary skill set to identify organisms based on morphological characteristics alone, it is expected that the use of DNA barcoding as an identification method will continue to increase in the future (Ratnasingham and Hebert 2007, 2013). In addition to identify species, such genetic analysis of the DNA barcoding region is also highly valuable in the detection of cryptic species and the reassessment of a species’ diversity (Tojo et al. 2017; Okamiya et al. 2018; Saito et al. 2018; Yano et al. 2019).

In the aquatic insects of interest in this study, the standard DNA barcoding region is the mitochondrial DNA (mtDNA) COI region (658 bp). In invertebrates including insects, the primer set by Folmer et al. (1994) and Hebert et al. (2003) is widely used, and this primer set is well known as the “universal primer” for invertebrates. In some species groups, however, amplification of DNA fragments with the universal primer set is difficult (e.g. marine invertebrates). In such cases, another modified primer set applied to the same COI region is also often used (Geller et al. 2013). Due to the high versatility of the above two primer sets, and the ease of detecting both inter and intra species’ genetic polymorphisms in the mtDNA COI region, these primer sets have been adopted in many studies targeting invertebrates including insects, as the first step in their molecular phylogenetic analyses (Folmer et al. 1994; Hebert et al. 2003; Geller et al. 2013). In fact, the “BOLD System (Barcode of Life Data System)”, which is operated by an international organization, has registered more than 3.5 million COI region data sets to date (as of Feb 26th, 2020; URL https://biodiversitygenomics.net/projects/bold/), and users can use them to identify a wide range of taxa using only sequence data of the COI region (Collins and Cruickshank 2013).

In this study, the sequences of the mtDNA COI region of corixid aquatic insects (Heteroptera, Corixidae) were analyzed using Folmer’s universal primer set (Folmer et al. 1994), and it was found that pseudogenes could also be amplified. Although the problems caused by pseudogenes, especially NUMTs: nuclear sequences of mitochondrial origin (nuclear mtDNA pseudogenes: Bensasson et al. 2001), have been reported widely in eukaryotes, their detection among heteropteran insects observed in this study is the first case. Therefore, for these aquatic bugs, we would like to highlight the risks involved with the DNA barcoding method when performing genetic analyses with such a widely used primer set (Folmer et al. 1994; Hebert et al. 2003; Geller et al. 2013) without careful data checking. We would also like to propose a method to avoid the amplification of pseudogenes in this group.

Materials and methods

To analyze the COI barcoding region, the total genomic DNA was extracted and subsequently purified from nine Hesperocorixa aquatic bugs; two specimens of Hesperocorixa kolthoffi (Specimen collection site: Arige, Wakamatsu-ku, Kitakyushu City, Fukuoka Prefecture, Japan, GenBank accession number (acc. no.) LC528377, LC528378; Fuchu, Sakaide City, Kagawa Prefecture, Japan, acc. no. LC528379, LC528380), one specimen of Hesperocorixa distanti distanti (Shippu, Atsuta-ku, Ishikari City, Hokkaido Prefecture, Japan, acc. no. LC528370, LC528371), and four specimens of Hesperocorixa hokkensis (Miyakozawa Wet-land, Oyama, Tsuruoka City, Yamagata Prefecture, Japan, acc. no. LC528372, LC528373; Nanko Park, Shirakawa City, Fukushima Prefecture, Japan, acc. no. LC528374–LC528376). Using the total genomic DNA as a template, the mtDNA COI region (658 bp) was amplified by PCR with Folmer’s universal primer set (Folmer et al. 1994; Fig. 1, Table 1). Subsequently, the DNA nucleotide sequences of the mtDNA COI region were analyzed by the “direct sequence” method. The method used for the total genomic DNA extraction and purification of PCR products can be found in Suzuki et al. (20132014, 2019; Takenaka and Tojo 2019; Takenaka et al. 2019).

Fig. 1
figure 1

Primer locations within the whole mtDNA COI region

Table 1 PCR primers used in this study

PCR with Folmer’s universal primer set and that with LCO1490 and HCOoutout were done under the same thermal and chemical conditions outlined in previous studies (Saito and Tojo 2016a, b; Saito et al. 2016, 2018). Since the HCOoutout primer has a different design to Folmer's universal primer, it completely covers the DNA barcoding region, and it is also clear that the HCOoutout primer is useful to analyze various invertebrate taxa (Pickett et al. 2006; Clouse and Wheeler 2014; Sanchez and Cassis 2018). Regarding the PCR of this tissue specimen, rTaq (TOYOBO, Osaka) was used as a DNA polymerase. As for PCR, a 2720 Thermal Cycler (Applied Biosystems, Tokyo) was used. All primer sets and relationships used in this study are described in Fig. 1 and Table 1.

Phylogenetic analysis was performed using the maximum-likelihood method (ML; Felsenstein 1981) with MEGA ver. 6.06 (Tamura et al. 2013) and the Bayesian method using BEAST ver. 2.5.2 (Bouckaert et al. 2019). Nodal support was measured with 1000 bootstrap replicates (Felsenstein 1985), and the posterior probabilities in BEAST 2.5.2. Bayesian analysis used 50 million Markov Chain Monte Carlo (MCMC) cycles with a sampling frequency of 1000. To obtain a consensus tree, data from the initial 10% of cycles were discarded as burn-in. Prior to ML and Bayesian phylogenetic estimations, the program KAKUSAN4 (Tanabe 2007) was used. TN93 + G + I was chosen as the best-fit model.

Results and discussion

Two types of DNA sequence with different lengths were obtained by the direct sequence results from Folmer’s universal primer set (Fig. 2). One of the nucleotide sequences was of standard length (658 bp; e.g. acc. no. LC528377, H. kolthoffi) and the other was a slightly shorter sequence (652 bp; e.g. acc. no. LC528370, H. distanti distanti, acc. no. LC528374 H. d. hokkensis). This shorter sequence (652 bp) was considered to be a pseudogene, and not the targeted sequence of the mtDNA COI region for this analysis. On the other hand, H. d. hokkensis has four haplotypes, and one of them had exactly the same sequence as the haplotype of H. d. distanti. The haplotype of H. kolthoffi is a singleton. In addition, some “double peaks” were detected in two specimens (acc. no. LC528370 and LC528374) amplified by the universal primer set. For sites where double peaks were detected, the base with the larger peak was used for sequence comparison (Fig. 2).

Fig. 2
figure 2

Alignment of the DNA sequences of the mtDNA COI region of Hesperocorixa aquatic bugs and associated pseudogenes. Numbers in the sequence names are unique specimen numbers. Three sequences with black backgrounds were amplified with Folmer’s universal primer set, and three sequences with gray backgrounds were amplified with the LCO1490 and HCOoutout primer set. Matching sites between all sequences are indicated with an asterisk. The sequences of the LC528370 and LC528371 are the results of analyzing the same sample with universal primer set (LC528370) and LCO1490 and HCOoutout primer set (LC528371). The same applies to the relationship between LC528374-75 and LC528377-78

We counted matches and mismatches to calculate the similarity between these different length nucleotide sequences, and it was relatively high (ca. 85%). However, the true COI sequence and putative pseudogene sequences were not monophyletic. Within the short sequences, the “stop codon” and “indel” were detected (Fig. 3). The phylogenetic analysis of these data sets is shown in Fig. 4. The short-typed sequences of H. d. distanti and H. d. hokkensis (acc. no. LC528370, LC528374) analyzed with Folmer’s universal primer set were positioned outside the clade consisting of the same species and closely related species of aquatic Heteropterans (i.e., the short sequences were outside the clade of this species group).

Fig. 3
figure 3

Alignment of the protein sequences translated from the mtDNA COI sequences. Numbers in the sequence names are unique specimen numbers. Three sequences with black backgrounds were amplified with Folmer’s universal primer set, and three sequences with gray backgrounds were amplified with the LCO1490 and HCOoutout primer set. Matching sites between all sequences are indicated with asterisks. The sequences of the LC528370 and LC528371 are the results of analyzing the same sample with universal primer set (LC528370) and LCO1490 and HCOoutout primer set (LC528371). The same applies to the relationship between LC528374-75 and LC528377-78

Fig. 4
figure 4

Phylogenetic tree reconstructed from the COI gene (658 bp) and the pseudogenes (652 bp) of Corixidae. Maximum-likelihood bootstrap confidence and Bayesian posterior inference are indicated at nodes. Sequence names are the GenBank accession number plus taxon name. Primer sets used to amplify the sequences are indicated with suffixes as -u (Folmer’s universal) and -m (modified). Corixid pictures were taken by T. Mitamura, one of the authors (H. d. hokkensis), and by Mr. Kei Hirasawa (H. d. distanti and Hesperocirixa kolthoffi). The sequences of the LC528370 and LC528371 are the results of analyzing the same sample with universal primer set (LC528370) and LCO1490 and HCOoutout primer set (LC528371). The same applies to the relationship between LC528374-75 and LC528377-78

The PCR product amplified with the LCO1490 and HCOoutout primer set included the same-length sequence (658 bp) as the ordinary COI region. These results revealed that in this species, at least two regions were amplified by Folmer’s universal primer set. Since one of these was not detected by long PCR analysis, these results suggest that there may be a pseudogene outside the mitochondrial genome, or at the very least, outside the original COI region.

The existence of such “pseudogene(s)” in the mtDNA COI region is widely known in some insect orders [e.g., Diptera, Hymenoptera and Lepidoptera (Hazkani-Covo et al. 2010); Orthoptera (e.g. Song et al. 2014); Psocoptera (Chen et al. 2014); Coleoptera (King et al. 2015); Hemiptera s. str. (Tay et al. 2017)]. However, there has been no such report among Heteropterans (Heteroptera s. str.); therefore, this is the first time this has been reported. These findings highlight the need to consider the possible presence of pseudogenes when adopting the DNA barcoding method to identify species in this group. We think the “HCOoutout” primer set for DNA barcoding of Heteropterans will be an effective way to avoid contamination by pseudogenes.

In addition, DNA-based identification has recently become a popular method in the other types of studies: for example, an increasing number of studies have reported rapid investigation of biological community structures in a habitat by means of metabarcoding and/or metagenomic analysis. More recently, environmental DNA (eDNA) analyses that can investigate a taxonomic group of organisms inhabiting ponds, lakes, and rivers by merely sampling water from them are now frequently carried out (Minamoto et al. 2012; Bista et al. 2017; Doi et al. 2017). To carry out eDNA analyses more reliably and comprehensively, it is very important to accumulate more error-free DNA sequence information in the various genetic regions of the target taxa (e.g. the mtDNA COI, 16S rRNA, and cyt b region, and the nuclear histone H3 region).