Introduction

A fundamental challenge for conservation biology is to accurately identify units of biodiversity and to diagnose those units that merit conservation concern (Vane-Wright et al. 1991; Moritz 1994, 2002; Vogler and DeSalle 1994; Bowen 1998, 1999; Haig 1998; Grady and Quattro 1999; Paetkau 1999; Crandall et al. 2000; Goldstein et al. 2000; Fraser and Bernatchez 2001; Agapow et al. 2004). Identification or circumscription of taxa consists of two different, yet equally important, aspects. First, identification can refer to the accurate demarcation of a taxon previously described, such as a species or population with a known conservation status. Second, identification can include the discovery of a new taxon and determination of whether it merits conservation attention. Traditionally, identification has been the responsibility of the taxonomist, who would often combine morphological, behavioral, life history, ecological, and genetic data to determine taxon identity. However, the number of taxonomic specialists, those with an intimate knowledge of a group of organisms, is declining. Funding for alpha taxonomy is increasingly scarce and even museums, the hearts of taxonomy, are facing financial and philosophical challenges (Noss 1996; Wheeler 2004). Conversely, the field of molecular taxonomy (or molecular systematics) is growing in popularity.

Molecular taxonomy uses genetic variation, usually from one or more genes, to construct a phylogeny with the implicit assumption that the gene genealogies reflect the phylogeny of the species sampled (Pamilo and Nei 1988; Harrison 1989; Brower and DeSalle 1994; Brower et al. 1996; Degnan and Rosenberg 2006). Molecular taxonomy has the advantage that it is relatively easily applied, does not require the same expertise as traditional taxonomy, and, importantly, provides an evolutionary framework for the taxa in question (Tautz et al. 2002, 2003). The pinnacle of enthusiasm for molecular approaches to taxonomy is DNA barcoding, which proposes the use of a 648 base pair (bp) fragment of the mitochondrial cytochrome c oxidase subunit one (COI) gene to identify and delineate species (Hebert et al. 2003a, b). This approach has some advantages, including the ability to obtain taxonomic information from portions or fragments of organisms, its applicability at any life stage (e.g. Sperling et al. 1994; Greenstone et al. 2005; Miller et al. 2005), and the speed with which data can be obtained. The concept of using DNA to identify an organism is not new, and routinely employs a BLAST (Basic Local Alignment Search Tool) search on NCBI’s (National Center for Biotechnology) GenBank or similar databases (Tautz et al. 2003; Hebert et al. 2004b). DNA barcoding is simply an attempt to standardize which piece of DNA is used.

The arguments put forth by proponents of DNA barcoding strongly suggest that molecular approaches and a common database will be increasingly useful for identification. In essence, the molecular database will act as a “molecular field guide,” facilitating identifications. They also argue that discovery of cryptic species will be enhanced with this approach. Discovery of cryptic variation has long been the purview of molecular phylogenetics and the utility of molecular approaches for this aspect of identification is undeniable (Donnellan and Aplin 1989; Good 1989). In fact, some molecular phylogeneticists have criticized the DNA barcoding agenda as being a misapplication and simplification of molecular techniques (Sperling 2003a; Rubinoff 2006). Our concern here is not with all of molecular phylogenetics, but with the utility of molecular taxonomic approaches for demarcation, especially when they utilize single-locus genetic data (as emphasized by proponents of DNA barcoding) and are applied to taxa of conservation interest. What might the consequences be of using DNA barcodes, or any single locus sequence, in conservation?

Applying molecular taxonomy to conservation certainly has advantages for screening for illegal trafficking of endangered species (Baker et al. 1996; DeSalle and Birstein 1996; Palumbi and Cipriano 1998; Ludwig 2006) and revealing cryptic species or lineages previously undescribed (e.g. Brower 1996; Omland et al. 2000; Witt et al. 2006; but see Irwin 2002) (for other applications of phylogenetics in conservation see Purvis et al. 2005). However, the utility of phylogenetic approaches for identifying closely related taxa (particularly those of recent origin) or for diagnosing populations or lineages that merit protection is unresolved (Sperling 2003a). The term evolutionary significant unit (ESU) has been used to refer to populations or lineages that represent unique, significant adaptive variants within species (Avise 1989; Moritz 1994, 2002; Vogler and DeSalle 1994; Bowen 1998, 1999; Grady and Quattro 1999; Paetkau 1999; Crandall et al. 2000; Goldstein et al. 2000; Fraser and Bernatchez 2001; Agapow et al. 2004). Increasingly, molecular taxonomy and phylogenetic analyses have been used to identify ESUs, based upon the assumption that historically distinct populations have the greatest potential to contain distinct adaptive variation. However, earlier incarnations of the ESU included the use of other information in addition to molecular data: “Identification of ESUs within a species was recognized as a difficult task, requiring the use of natural history information, morphometrics, range and distribution data, as well as protein electrophoresis, cytogenetic analysis, and restriction mapping of nuclear and mitochondrial DNA” (Ryder 1986). For a population to be designated as an ESU based solely on sequence variation at a single marker locus ignores the fact that adaptive evolution and population differentiation can proceed at a rapid pace compared to rates of molecular evolution at a presumably neutral marker. Thus, historical isolation would trump morphological, life history, and ecological discontinuities discordant with molecular data. Despite the claim that molecular phylogenetic approaches will provide a more accurate view of biodiversity (Hebert et al. 2003a, b; but see Brower 2006), a number of processes occurring at or below the species level, including recent differentiation and hybridization, will result in an underestimation of biodiversity.

Our principal objectives are to provide researchers and conservation managers with a discussion of patterns and processes at or near the species level that can create problems associated with molecular taxonomy based on single-locus genetic data. The relevant literature from genetic studies of the Lepidoptera is examined, which provides a context in which future conservation genetics work may be evaluated. We urge that employment of molecular genetics data for identification of units for conservation be undertaken with healthy skepticism and an informed opinion about both the utility and the limitations of these techniques.

We focus here on the Lepidoptera for several reasons. The order has a long history of study and is well described compared to other invertebrate taxa. The ecologies and life histories of the Lepidoptera are relatively well known; this is especially true of the butterflies, which enables us to carefully examine the efficacy of molecular taxonomy for this group. This group is also charismatic and important in conservation. Disproportionately more Lepidoptera appear on the endangered species list than any other group of insects and the Lepidoptera have been used as bioindicator species for identification of diversity hotspots or as indicators of ecosystem health (Kremen 1992; Gaston and David 1994; Beccaloni and Gaston 1995; Cremene et al. 2005; Werner and Buszko 2005). Furthermore, several recent examples of the DNA barcoding approach involve investigations of Lepidoptera (e.g. Hebert et al. 2004a; Hajibabaei et al. 2006). Lepidoptera also serve as an important test case for the use of mitochondrial DNA (mtDNA) in species identification because female Lepidoptera are heterogametic, and Haldane’s rule predicts reduced viability of the heterogametic sex in hybrids (which should limit introgression of a maternally inherited genetic element; discussed further below) (Sperling 2003b).

Survey of the literature on the Lepidoptera

We searched the genetic literature involving Lepidoptera for any study in which genetic data were found to be in conflict in some way with nominal taxonomic designations. Specifically, we looked for examples in which researchers set out to study taxa (species, subspecies or “races”) believed to be distinct based on morphological, ecological, or behavioral characteristics, but concluded that mtDNA sequences alone could not be used to define or identify the taxa in question. Studies which simply surveyed genetic variation within ecologically or morphologically polymorphic species without any particular expectation of reproductive isolation or genetic divergence among taxa were not considered.

The following key words generated 437 studies in ISI Web of Science: “Lepidoptera* OR butterfl* OR moth* NOT mother*” combined with “mtdna OR mitochondrial.” Of those 437 studies, 147 examined mtDNA sequences from two or more Lepidopteran taxa; these studies ranged from phylogeographic studies of mtDNA variation among races and subspecies to systematic studies including only one or two specimens per taxon. For the sake of brevity in the results reported here, we focus on studies which used direct sequencing, as opposed to restriction-site analysis (in a very small number of cases, we have included studies using restriction-site analysis if they also included extensive sequencing to verify the identity of haplotypes). We avoided duplication in that list of 147 by removing a small number of studies in which similar conclusions were reached with the same taxa using different data sets.

We found 31 studies in which genotypic information was perceived to be in conflict with nominal taxonomic boundaries (Table 1). That is 21% of the total number (147) of studies using mtDNA sequence data. It should be noted that many of the 147 studies focused on higher-level systematics, and had little to no chance of discovering taxonomic-molecular discrepancies among closely related taxa. Again for the sake of brevity in the results reported here, we did not report cases in which one or two individuals (out of many dozens or hundreds sequenced from individual taxa) were found to be a mismatch between taxonomic designation and mtDNA genotype (e.g. Dasmahapatra et al. 2002; Ounap and Viidalepp 2005; Kronforst et al. 2006; Mullen 2006).

Table 1 A summary of published studies in which patterns from the analysis of mitochondrial DNA (mtDNA) sequences were not consistent with nominal taxonomic boundaries. Designations of rank or status used below (i.e. races, subspecies, or species) are those used by the authors of each study; “species complex” generally refers to a group of taxa of controversial taxonomic status containing at least some members that are distinct and possibly genetically isolated. Only the source of mtDNA data is reported below (although nuclear DNA was studied in many cases)

Nine of the 31 studies documented in Table 1 involve taxa that are either threatened, endangered, or rare. In at least 20 of the 31 studies, the discrepancy between genetic data and taxonomy was caused by the sharing of mtDNA haplotypes among focal taxa. In other words, individuals assigned a priori to different taxa possessed identical sequences of mtDNA. The population genetic processes underlying this and other sources of genetic-taxonomic conflict are discussed in detail below.

The percentage of studies reported above should not be taken as an estimate of the proportion of sister taxa in the Lepidoptera that can or cannot be identified with single-locus genetic data. There is an inherent bias, particularly within taxonomically controversial groups, which prohibits such a tally: a researcher studying a suite of closely related taxa may decide which category of names to use (species vs. subspecies, for example) only after the genetic data has been analyzed. Rather than put a number to the potential error rate in molecular taxonomy, our review of the literature highlights cases in which reliance on single locus data for taxonomic identification may present serious problems. In most of the cases in our survey of the Lepidopteran literature, the authors had some experience with the natural history of the studied organisms. Consequently, failures of molecular taxonomy were readily identified. As molecular taxonomy is applied to less well known taxa, it may become nearly impossible to detect such shortcomings.

Hajibabaei et al. (2006) examined the efficacy of DNA barcoding in three Lepidopteran families and concluded that 97.9% of the 521 species were accurately distinguished. This figure is truly impressive, however, it is not clear how these results apply to assessing the utility of DNA barcoding for conservation at the species or subspecies levels. Hajibabaei et al. (2006) used relatively few individuals per species and all specimens were collected from a single locality (Area de Conservación Guanacaste, Costa Rica). Consequently, no information about geographical genetic variation within species was assayed. This feature of their data introduces a bias that prevents a clear assessment of the utility of barcoding for conservation (Sperling 2003a). This is because sampling from a single locality provides little information on the efficacy of DNA barcoding to distinguish units for conservation from closely related lineages which may often be allopatric (e.g. Hall and Harvey 2002). Furthermore, data from a single locality provide no insight into geographical phenomena such as barriers between taxa, clinal variation, or isolation by distance. A more comprehensive test of DNA barcoding was done by Meyer and Paulay (2005) using marine gastropods. They surveyed genetic variation across the ranges of gastropod species and found considerable error rates (of species identification), as high as 34% for some groups (Meyer and Paulay 2005), because intraspecific variation overlapped with interspecific variation (see also Funk and Omland 2003).

Patterns and processes

There are several important mechanisms that may cause single-locus molecular data to misdiagnose the true evolutionary relationships among populations, subspecies and species. Here we briefly outline these mechanisms and describe several relevant methodological approaches.

Incomplete lineage sorting and the problem of recency

Perhaps one of the most important causes of taxonomic misdiagnosis by single locus genetic data arises from the simple fact that selection can change trait values more rapidly than the neutral processes of drift and mutation can create lineage divergence. Natural selection can be extremely effective over just a few generations (Lande and Arnold 1983; Schluter and Smith 1986; Grant and Grant 1993; Reznick and Ghalambor 2001) and can play a direct and important role in speciation (Funk 1998; Nagel and Schluter 1998; Orr and Smith 1998; Jiggins et al. 2001; Ramsey et al. 2003; Forister 2005, Rundle and Nosil 2005). Estimates of the strength of selection across multiple studies also demonstrate that in certain situations selection associated with mate recognition and mate choice (sexual selection) may be especially effective (Hoekstra et al. 2001; Kingsolver et al. 2001; Naisbit et al. 2001).

Neutral processes, on the other hand, operate more slowly. For Lepidoptera, estimates of rates of sequence divergence range from 1.7% divergence per lineage per million years based on the general arthropod mitochondrial mutation rate (Brower 1994) to slower estimates of 0.39–0.51% per lineage per million years estimated in the swallowtail butterfly genus Papilio (Zakharov et al. 2004). Adaptive, quantitative traits (such as morphological or behavioral traits) are likely to evolve at rates considerably faster than genes evolving by neutral processes. To put these rates in perspective for the 648 bp fragment of COI recommended for DNA barcoding, we can expect one nucleotide change to occur every 130,000–390,000 years on average (this is of course only a very rough estimate and it should be noted that mutations occur stochastically rather than regularly as this kind of calculation misleadingly implies). The rate disparity between the evolution of adaptive traits and presumably neutral genes becomes particularly important for molecular taxonomy in adaptive radiations. Here, selection operating on morphological, ecological and/or behavioral traits can far outpace neutral changes at single loci such as mtDNA markers. (Neutrality of mtDNA sequence variation is assumed throughout this paper; for a discussion of potential non-neutrality of mtDNA variation under certain conditions and the consequences for population and phylogenetic analyses of such non-neutrality see Ballard and Whitlock (2004) and Hurst and Jiggins (2005)).

A consequence of the different rates of evolution associated with adaptive traits and neutral genetic variation is the pattern of incomplete lineage sorting, in which some alleles within one taxon may be most closely related to alleles in another taxon. Put another way, for cases of rapid divergence, there is some time in which the genetic variation that existed in the ancestral species (ancestral polymorphism) is not sorted into monophyletic gene trees (Pamilo and Nei 1988; Doyle 1992; Maddison 1997; Paetkau 1999) (Fig. 1, based on Avise (1994)). Variation in each descendant lineage is thus a sample of the ancestral polymorphism. Gene trees in new species can therefore be polyphyletic or unresolved for some time after a speciation event (Figs. 1 and 2) (Neigel and Avise 1986; Brower et al. 1996). This situation might continue until reproductive isolation, mutation and drift eventually create reciprocal monophyly and complete lineage sorting between the sister taxa, at which point the gene tree might be congruent with the species tree (Figs. 1 and 2) (Maddison 1997). Assuming neutrality, Neigel and Avise (1986) and Takahata (1989) demonstrated that this period of incongruence between gene trees and species trees for mtDNA will last on average for a time (in generations) roughly equal to four times the effective population size (4N e generations). Consequently, genealogical relationships among mtDNA haplotypes, even if correctly reconstructed, are unlikely to reflect the true phylogenetic relationships of the two lineages for a considerable period of time until mutation and drift eventually “sort the lineages” and the two descendant taxa become reciprocally monophyletic (Pamilo and Nei 1988; Takahata 1989; Maddison 1997).

Fig. 1
figure 1

Gene genealogies of diverging taxa sampled over time. Circles indicate genes in each generation (horizontal row). Thick lines indicate the species tree. Thin solid lines indicate the genealogy of the sampled genes labeled with letters. Dotted lines indicate genealogical relationships of unsampled genes. (a) Shortly after speciation, the gene tree exhibits polyphyly with respect to the two species. (b) Later, the gene tree exhibits parayphyly with respect to the two species. (c) Even later, the gene tree exhibits monophyly with respect to the two species. (d) Reticulate genealogy caused by gene flow (introgressive hybridization). The gene tree exhibits parayphyly with respect to the two species

Fig. 2
figure 2

A schematic representation of genealogical evolution. The geographical distribution of haplotyes (sequence alleles) of a hypothetical speciation event and corresponding gene trees and networks of sampled genes over time. (a) An ancestral population (at the bottom of the figure) containing two haplotypes (a and b) undergoes allopatric divergence. Over time, new haplotypes (i.e. cf) evolve by mutation, some become extinct, and allele frequencies change by genetic drift. (B) Gene trees of the haplotypes change over time. Shortly after speciation, the gene tree is unresolved, exhibiting a basal polytomy. Later, new haplotypes evolve (i.e. (c–f)) although the ancestral haplotypes (a and b) are still extant. Even later, ancestral haplotypes become extinct and the last gene tree (top) exhibits monophyly with respect to the two species. Note that the ancestor-descendant relationships between haplotypes are obscured. (c) Haplotype networks representing the genealogical relationships among haplotypes over time. Circles with letters indicate haplotypes. A closed circle represents a missing but inferred haplotype that is presumed to be extinct (i.e. haplotype a in this example).

The studies reported in Table 1 provide numerous illustrations of the early stages of the diversification and lineage sorting process as depicted in Figs. 1 and 2. For example, the butterfly Acrodipsas cuprea in south-eastern Australia is composed of four geographic races characterized by differences in male wing color pattern that apparently evolved recently and in allopatry (study 5 in Table 1). Six mitochondrial haplotypes were found throughout the four races: three of them were “private” haplotypes (found in only one taxon), while the other three were shared among taxa (compare to Fig. 2A immediately after the gene pools are divided). Similarly, a mix of private and common haplotypes were found among species in the Mitoura complex (study 19), and among morphologically and phenologically divergent host races of Prodoxus quinquepunctellus yucca moths (study 11). Further along in the process of differentiation, shared haplotypes may no longer exist, but sequences sampled from different taxa do not yet form reciprocally monophyletic lineages. This has caused many authors to infer that focal taxa should not be considered genetically distinct taxa or species (e.g. studies 16, 21, 29). This problem is not unique to the Lepidoptera. Funk and Omland (2003) found incomplete lineage sorting, introgression and other processes resulting in non-monophyletic relationships in 23% of the 2319 metazoan species they surveyed.

While the idea of incomplete lineage sorting conflicts with common intuitions about the utility of phylogenetics for delineating species, it is worth noting that mitochondrial genes (and chloroplast genes in plants) represent a tiny fraction of the total genome. Natural selection operates on specific, ecologically relevant gene loci. The tiny genome fraction that consists of mitochondrial genes usually has nothing to do with adaptive differentiation driven by selection. In a very real sense, the genome can be considered a mosaic of genes, some of which undergo rapid allele frequency changes due to selection; the rest (the majority), including mitochondrial genes, experience the effects of genetic drift at a relatively slower pace.

The phenomenon of incomplete lineage sorting should not itself be considered erroneous. While it can create difficulties for diagnosing taxonomic boundaries, lineage sorting represents the natural genealogical processes that occur during divergence and speciation. A solution is to examine genealogies across multiple loci (Wu 1991; Doyle 1992; Moore 1995, 1997; Maddison 1997; Hoelzer 1997). Complete consensus across multiple gene trees can rule out incomplete lineage sorting. Likewise, conflict between gene trees can be used to detect the presence of ancestral polymorphism. A phylogeny that accurately reflects evolutionary relationships among taxa may be estimated from the combination of data from multiple loci. As Maddison (1997) eruditely suggested, the species tree “can be visualized like a fuzzy statistical distribution, a cloud of gene histories.”

Hybridization

The second important cause of misdiagnoses of taxa is hybridization (i.e. interbreeding between distinct populations). Introgressive hybridization occurs when genes from one lineage “invade” the gene pool of another lineage through hybrid matings (Fig. 1D) (Funk and Omland 2003, Mallet 2005). Hybridization, followed by backcrossing, can lead to the establishment of the introgressing allele(s) at high frequency in the new gene pool by chance (genetic drift) or by selection if the introgressing allele is positively selected (selective sweep).

The likelihood of this gene flow occurring is reduced if the introgressing gene is linked to loci under selection or is itself disfavored by selection (Barton 1985). Thus presumed neutral loci unlinked to nuclear genes, such as mtDNA and chloroplast DNA (cpDNA), are the most likely candidates for introgression. However, while cytoplasmic DNA (mtDNA and cpDNA) may move readily across species boundaries, nuclear markers may also show introgression (e.g. Strieff et al. 2005). Chan and Levin (2005) considered introgression in the context of models of frequency-dependent assortative mating. For reasonable parameter values (frequency of hybridization, strength of selection, etc.), they found that extensive introgression can occur quite readily, even without prolonged contact between the species. Sympatry was not required for introgression. Occasional immigrants, including rare, long-distance dispersers, can lead to introgression.

In a phylogenetic context, introgressed genes clearly do not represent the true evolutionary relationships among the taxa involved (Fig. 1) and therefore present a serious problem for accurate diagnoses of units for conservation. Gompert et al. (2006; study 8 in Table 1) recently described this phenomenon in the case of the endangered Karner blue butterfly (Lycaeides melissa samuelis) in which mtDNA sequences are shared between this endangered species and its close relative the Melissa blue butterfly, L. m. melissa (although mtDNA haplotypes are introgressed, nuclear loci are diagnostic for the two taxa). Other examples in which introgression has been implicated include studies 12, 16, and 21 in Table 1.

For the Lepidoptera, mitochondrial introgression should be less frequent because the females are the heterogametic sex (Sperling 1993, 2003b). Haldane’s rule predicts that the heterogametic sex should exhibit reduced viability as hybrids compared to the homogametic sex (Haldane 1922; Turelli and Orr 2000). Since mtDNA is maternally inherited, Haldane’s rule should restrict introgression in the Lepidoptera (Sperling 1993). Despite this, several clear cases of mitochondrial introgression in the Lepidoptera have been reported (examples from Table 1, cited above; also, Jiggins et al. 1997), and introgression may be even more prevalent in animals with male heterogamety.

Detecting introgression can be difficult but is most easily recognized by discordant patterns of variation across multiple loci. mtDNA introgression is most easily detected by comparison to data from multiple nuclear markers (e.g. Gompert et al. 2006). It is clear that data from multiple markers (i.e. several loci) with varying modes of inheritance offer the best possibility for not only detecting discordant patterns arising from incomplete lineage sorting and/or introgressive hybridization, but also for accurately capturing the evolutionary history of closely related taxa (Maddison 1997; Chan and Levin 2005; Gompert et al. 2006).

Taxonomic rank

Perhaps the mechanisms discussed above are irrelevant to molecular taxonomy, and DNA barcoding in particular, because data from mtDNA can not be expected to diagnose taxa below the species level (for precisely the reasons discussed above, such as lineage sorting). Thus any conflict between genetic data and taxonomy can be dismissed as a misapplication of technique. Such an argument is unsatisfactory for a number of reasons. Despite the fact that species are generally regarded as the fundamental unit of biodiversity, there is no general agreement on what a species is (for a review of species concepts see Harrison 1998; Templeton 1989; Hey 2001; Coyne and Orr 2004). This problem extends to subspecies, and for molecular taxonomy the designation of a species or subspecies rank does not necessarily predict sequence divergence (Cognato 2006). For example, the percent sequence divergence for COI between the Melissa blue, L. m. melissa, and the endangered Karner blue, L. m. samuelis, averages 2.0% (not including introgressed haplotypes; data from Nice and Shapiro 1999 and Gompert et al. 2006). Despite the presence of shared haplotypes, average molecular divergence between the distinct haplotypes in the two subspecies exceeds, or is comparable, to the divergence observed at the same locus between many recognized species pairs, for example Papilio canadensis and P. rutulus (Papilionidae) are 2.0% divergent (Caterino and Sperling 1999), Ithomia salapia and I. iphianassa (Nymphalidae) are 0.3% divergent (Mallarino et al. 2005) and Chrysoritis palmus and C. nigricans (Lycaenidae) are 1.3% divergent (Rand et al. 2000). Furthermore, matings between some butterfly species pairs that exceed 7% divergence at COI can produce viable hybrid offspring (for review of Lepidoptera hybrid viability, see Presgraves (2002)).

Some authors have suggested that the Karner Blue deserves full species status (e.g. Cech and Tudor 2005). From that observation, one might conclude that the sequence divergence between the Melissa blue and Karner blue noted above (comparable to other species pairs) is not surprising. However, whether species or subspecies, approaches based on mtDNA fail (due to shared haplotypes) to recognize the two taxa, which are divergent in wing pattern, male genital morphology, life history and host plant use (Gompert et al. 2006). Thus the question of formal taxonomic rank must be viewed separately from the question of whether or not sequence-based approaches are able to distinguish taxa that are recognizable on ecological, morphological, or behavioral grounds.

It could be argued by the proponents of DNA barcoding that it is inappropriate to use the barcoding approach below the species level because the rank of subspecies implies that taxa are not clearly differentiated. From this position it logically follows that discovery of cryptic variation is impossible. The problem is that cryptic species cannot exist or be discovered if taxonomy is assumed to be correct a priori. If a taxon is already identified as one species, then it cannot be two (one being cryptic). Similarly, it cannot be possible to discover that a subspecies warrants species status. This tautology is irrelevant when we (accurately) treat phylogenetic hypotheses as hypotheses. Taxonomic rank will not necessarily, therefore, predict the utility or accuracy of DNA sequence-based approaches for identification because population and evolutionary processes are not influenced by Latin binomials.

Analytical advances

Population genetic phenomena occurring at or near the species level, such as lineage sorting and hybridization, not only present problems for DNA barcoding approaches, they require their own analytical tools, such as the implementation of coalescent, modeling-based approaches (discussed below). Methods of DNA sequence analysis employed at higher taxonomic levels (i.e. all of the statistical machinery used for phylogenetic reconstruction) may be inappropriate at lower taxonomic levels (Posada and Crandall 2001). There are several reasons for this:

1. Phylogenetic algorithms assume that evolutionary relationships arise by bifurcations of ancestral species to produce two new species. However, hybridization and hybrid speciation events (as well as recombination generally) can create reticulating patterns that violate this assumption (Posada and Crandall 2001) (Fig. 1 D).

2. Taxa (and/or sequences) at the tips of phylogenetic trees represent extant taxa (and/or sequences), whereas all “ancestral” taxa (and/or sequences) are, by definition, extinct (Swofford et al. 1996; Felsenstein 2004, Freeland 2005). Variation in sequence data from within populations or species represents the genealogical relationships among the sequences and it is clearly possible, and indeed likely, that “ancestral” sequences still exist. Thus the convention in phylogenetics of representing all extant sequences at the tips of trees fails to illustrate the genealogical patterns of descent among sequences within populations and species (Fig. 2).

3. Because of the inherent stochastic nature of lineage sorting and mutation, phylogenetic relationships across multiple loci are unlikely to be congruent until well after divergence is complete (see Incomplete lineage sorting and the problem of recency above). Phylogenetic analyses are not designed to deal with the “noise” created by the discordance among gene trees (Degnan and Rosenberg 2006). At best, combined analyses of multiple loci in these situations will create phylogenetic trees with little or no support which would highlight the recency of divergence. At worst, important information about population-level processes would be obscured or ignored.

Instead of phylogenetic methods, alternative tools have been used by population geneticists over the last decade to handle the astounding increase in population-level molecular data and the associated problems (Wakeley 2007). In Fig. 2C we present a hypothetical case using just one example of an alternative method: a haplotype network, which is a graphical representation of the genealogical relationships among sequences. Networks allow for “ancestral” sequences within the data set and are free of the assumption that relationships among sequences must be bifurcating. The following references are just a few of the sources which contain comprehensive introductions to networks and other tools available to population geneticists today: Posada and Crandall (2001), Lowe et al. (2004), Freeland (2005), Excoffier and Heckel (2006).

The most significant conceptual advance (relevant to the issues which motivated the present paper) may be the fact that certain patterns that are considered “noise” within traditional phylogenetic analyses actually represent valuable data which can be used to reconstruct and model evolutionary histories and processes. Coalescent theory (Hudson 1990, Kingman 2000) uses genealogical relationships among sequences (i.e. gene trees) to make inferences about demographic processes and histories of populations. For example, the sharing of mtDNA haplotypes among taxa or populations, which is simply inconvenient to analyses which assume bifurcating patterns of descent, can be used to estimate a class of population parameters, such as migration rate and ancestral population sizes, which get to the very heart of dynamic processes of diversification. Continuing advances in coalescent theory enable population geneticists to do more with sequence data than simply address taxonomic hypotheses (Hein et al. 2005).

Conclusions

Our goal here has been to point out that there are very important places where molecular data can be perceived to be in conflict with taxonomy, and that this conflict is most likely to occur in situations where identification is the most challenging: at or near the species level, and among closely related taxa. In addition, significant discrepancies between molecular, morphological, and ecological characteristics in some cases, at least in the Lepidoptera, involve taxa of conservation concern. It is important to note the variety of ways that the taxa included in Table 1 differ within each study. The taxa are divergent in various characteristics, from wing patterns (in a majority of the studies) to genetically based differences in phenology (e.g. studies 11 and 23). Those divergent characteristics caused the researchers to suppose a priori that the taxa would be genetically distinct; that expectation was not borne out by the mtDNA markers analyzed.

The frequent conflict between single-locus genetic data and taxonomy ultimately teaches us that diagnosis of units for conservation should be based on a recognition of the multiple forces that drive the evolution of molecular, ecological, morphological and behavioral characters and the differential in rates of evolution for these characters (Rubinoff 2006). Non-molecular data often provide critical information on adaptive differences between recently diverged lineages that may not be detected with DNA data. The combination of data from multiple genetic markers with more traditional taxonomic data is the most effective approach to avoid taxonomic misdiagnosis. Discrepancies between the two types of data highlight an underappreciated facet of biodiversity: namely that a certain portion of the taxa we observe and may wish to conserve are of relatively recent origin. Consequently, all population-level processes should be considered when attempting to identify units for conservation, not solely the presumed neutral dynamics that underlie mitochondrial DNA evolution.