Introduction

Diatoms are unicellular photoautotrophic eukaryotes which are responsible for at least 25% of the global carbon dioxide fixation (Falkowski et al. 1998; Field et al. 1998; Mann 1999; Smetacek 1999). They are an important part of benthic and planktonic biocoenoses and occur nearly ubiquitously in limnic, marine, and terrestrial ecosystems as well as in aerosols (Jahn et al. 2007). Therefore, diatoms are often used as biodindicators in water monitoring assessments and ecological studies (Stevenson and Pan 1999; Stoermer and Smol 1999). Even closely related taxa (excluding cryptic species) are often indicative of different ecological conditions (Poulíčková et al. 2008; Vanelslander et al. 2009). Hence, unambiguous identification of organisms down to species level is crucial for the quality of these studies. Archibald (1984) and Morales et al. (2001) have pointed out that many ecological and monitoring studies are misleading, because identifications have not been verified by experienced diatom taxonomists. To identify diatoms morphologically beyond the genus level is difficult and requires expert knowledge, especially because frustule morphology can vary considerably even within a population (Babanazarova et al. 1996; Bailey-Watts 1976; Jahn 1986; Medlin et al. 1991).

In cases of groups with poor morphological resolution, Hebert et al. (2003) promoted the concept of a DNA barcode to help with the identification of taxa. A DNA barcode is an instrument for the correlation of a taxonomically undetermined individual to a taxon with similar genetic sequence in a given reference database (Ratnasingham and Hebert 2007). However, a suitable barcode marker has to meet three requirements. The ideal barcode marker (1) consists of a short sequence that can be easily amplified and sequenced in one read following a standardised laboratory protocol, (2) is flanked by a conserved region in which universal primers can be placed, and (3) has the power to resolve organisms at species level (e.g. Hebert et al. 2003; Moritz and Cicero 2004; Stoeckle 2003). Therefore, as in any environmental sampling approach, the quality of the method is not only related to the extent and quality of the reference database but also to the number of taxa that can be identified unambiguously (Erickson et al. 2008), and to the rate at which taxa are retrieved from environmental samples.

Applying the DNA barcoding concept to diatoms promises great potential to resolve the problem of inaccurate species identification and thus facilitate analyses of the biodiversity of environmental samples. In particular, the use of DNA barcodes in diatoms can serve various purposes, such as (1) DNA-based species characterisation and (2) surveying the genetic diversity in an environment of interest. Each of these goals implies different requirements with respect to sequence characteristics. Whereas species characterisation needs sequences with high discriminatory power for defining and identifying even cryptic species, it is not necessarily dependent on fast and universal laboratory protocols. A survey of genetic diversity in environmental samples, however, often relies on high-throughput techniques and therefore needs universal primers and standard protocols where taxa do not have be resolved on the finest scale (e.g. subspecies, cryptic species) (e.g. Hamsher et al. 2011).

Various gene regions have been proposed as barcode markers for diatoms. The mitochondrial cytochrome oxidase I gene (cox1) has been widely used for barcoding animals and other organism groups (e.g. Blaxter 2004; Blaxter et al. 2004; Hajibabaei et al. 2006a; Hebert et al. 2004; Robba et al 2006; Saunders 2005; Seifert et al. 2007; Ward et al. 2005). Evans et al. (2007, 2008) successfully tested cox1 as a barcoding marker in 22 Sellaphora species and three other raphid genera of diatoms. Their study also included a test of the chloroplast ribulose-1,5-bisphosphate carboxylase oxygenase gene (rbcL), which was less variable than cox1 within the species sampling. However, in other organism groups such as red algae (e.g. Robba et al. 2006; Saunders 2005, 2008), brown algae (Kucera and Saunders 2008) and some green algae (e.g. Lewis and Flechtner 2004; McManus and Lewis 2005), the rbcL gene proved to be a promising barcode marker. Moniz and Kaczmarska (2009, 2010) proposed a combination of the nuclear 5.8S rRNA gene and ITS2 upon screening the most species-rich classes of diatoms including mainly marine taxa of the Mediophyceae and Bacillariophyceae. Furthermore, binary characteristics, such as presence/absence of compensatory base changes (CBCs) in the secondary structure of ITS2 or the presence/absence of certain indels have been used to resolve species level diversity in all kind of organisms, including diatoms (Müller et al. 2007). This, however, includes the additional procedural step of calculating and analysing the secondary structure and, therefore, is too laborious for standard high-throughput analyses of environmental samples.

In existing sequence databases, the most extensive data record available for diatoms concerns the nuclear small ribosomal subunit (SSU-rRNA gene), as the latter has been used widely for phylogenetic and taxonomic purposes (e.g. Behnke et al. 2004; Beszteri et al. 2001; Friedl and O’Kelly 2002; Kooistra and Medlin 1996; Medlin et al. 1996; Medlin and Kaczmarska 2004; Sarno et al. 2005; Sorhannus 2007). This means that a substantial reference volume is already available (Hajibabaei et al. 2007), even though identification quality often is not verifiable and therefore does not meet DNA barcoding requirements. The 18S rRNA gene has been suggested as a potential barcoding marker for various organism groups, e.g. nematodes, tardigrades, and diatoms (Bhadury et al. 2006; Blaxter 2004; Blaxter et al. 2004; Floyd et al. 2002; Jahn et al. 2007; Powers 2004). The 18S region has been tested for diatoms in a pilot study by Jahn et al. (2007) and has been used as a marker in other protist groups (Scicluna et al. 2006; Utz and Eizirik 2007).

The present study proposes a 390–410 bp long fragment of the 1800 bp long 18S rRNA gene locus as a barcode marker for the analysis of environmental samples with high-throughput technologies such as 454 sequencing or microarrays, and discusses its use and limitations for diatom identification. The partial 18S region includes a section that is termed V4 in the nomenclature of Nelles et al. (1984) and represents the largest and most complex of the highly variable regions within the 18S locus (Nickrent and Sargent 1991).

Using newly designed universal primers for the V4 region that are introduced below, the region is identified as the most applicable one for barcoding on the 18S locus. Furthermore, an optimised standard laboratory protocol (including DNA extraction, PCR amplification and sequencing) is provided which was developed using diatoms from various limnic genera across many families to represent the freshwater diatom diversity. The study includes taxa from the three major divisions of diatoms: Coscinodiscophyceae (e.g. Aulacoseira spp.), Mediophyceae (e.g. Cyclotella spp., Stephanodiscus spp.) and Bacillariophyceae, with both raphid (e.g. Nitzschia spp.) and araphid representatives (e.g. Fragilaria spp.) (Table 1).

Table 1 List of all taxa, including strains, EMBL accession numbers (18 s rRNA), voucher identification codes in the Herbarium Berolinense (B), and sampling localities

Methods

Taxon sampling

One hundred twenty three taxa from a wide range of genera throughout Bacillariophyta were used to test the universal applicability of different primer pairs of the 18S rRNA gene. The taxa sampled, the sample origins and/or corresponding EMBL numbers are listed in Tables 1 and 2. Vouchers of sequenced material are deposited in the Herbarium of the Botanic Garden and Botanical Museum Berlin-Dahlem (B), and described in more detail in AlgaTerra (Jahn and Kusber 2002+).

Table 2 List of the tested Sellaphora taxa with corresponding phenodemes, clone names (both after Evans et al. 2007, 2008) and EMBL accession numbers (18S rRNA)

To specifically test the power of the proposed barcode region to distinguish between closely related species, the genus Sellaphora (incl. Sellaphora pupula-group) was chosen as a test case (Table 2). This is a diatom genus with well-defined biological species concepts (Evans et al. 2007, 2008) as well as vouchered sequences.

Cultivation

DNA was isolated from non-axenic unialgal cultures derived from single cells isolated from environmental samples. The cultures were raised on a modified WC medium (Guillard and Lorenzen 1972) with salt concentrations of 28 g/l of CaCl2, 21 g/l of Na2SiO3 and 0.01 g/l of CuSO4. The cultures were stored in petri dishes sealed with Parafilm® M (American National Can Group; Chicago, IL) at 15–17°C and a 12 h day/night rhythm, or at room temperature and the ambient day/night cycle.

DNA isolation

The harvested cultures were transferred to 1.5 ml tubes. DNA was isolated using either Dynal® DynaBeads (Invitrogen Corporation; Carlsbad, CA, USA) or Qiagen® Dneasy Plant Mini Kit (Qiagen Inc.; Valencia, CA) following the respective product instructions. DNA concentrations were checked using gel electrophoresis (1.5% agarose gel) and Nanodrop® (PeqLab Biotechnology LLC; Erlangen, Germany). DNA samples were stored at −20°C until further use.

Secondary structure analysis

The secondary structure of the V4 region was analysed using Mfold (Zuker 2003) running under standard RNA settings (default), and compared to the secondary structure of a consensus sequence (Alverson et al. 2006) to identify possible primer regions within the 18S locus. Primers were designed manually. To assess the variability of the fragment within any given primer pairing, the consensus sequence of Alverson et al. (2006) was used.

Primer testing

All primers given in Table 3 were also tested for amplification and sequencing success at annealing temperatures of 50–54°C under the PCR regime mentioned below. Melting temperature, dimerisation between primer pairs and within single primers, as well as GC content were determined using SeqState under default settings (Müller 2005).

Table 3 Tested primer sequences for 18S amplificates (M13 tails shown in italics; → =forward, ← =reverse)

PCR amplification

The V4 region of the 18S locus was amplified using different primer combinations (Table 3). The polymerase chain reaction (PCR) mix (25 μl) consisted of 14.65 μl HPLC H2O, 2.5 μl 10× buffer S, 1.5 μl MgCl2, 2.5 μl pecGOLD dNTPs, 0.5 μl BSA, 1 μl of each primer (20 pm/μl), 0.35 μl pecGOLD Pur Taq® (all products by PeqLab Biotechnology), and 1 μl DNA sample. The PCR regime included an initial denaturation at 94°C (2 min), then five cycles consisting of denaturation at 94°C (45 s), annealing at 52/54°C (45 s), respectively, and elongation at 72°C (1 min), followed by 35 cycles in which the annealing temperature was lowered to 50/52°C, and a final elongation at 72°C (10 min). PCR products were visualised in a 1.5% agarose gel and cleaned with MSB Spin PCRapace® (Invitek LLC; Berlin, Germany) following standard procedure. DNA content was measured using Nanodrop (PeqLab Biotechnology).

A second PCR following the same protocol and primers (modified with 6 bp long 454 primertails for sample identification) was run to produce samples for the 454 sequencing. After PCR they were also cleaned with MSB Spin PCRapace® (Invitek LLC) following standard procedure. The samples were normalised to a total DNA content >200 ng using Nanodrop (PeqLab Biotechnology).

Sequencing

Sanger sequencing was used for the establishment of reference sequences, whereas 454 sequencing was conducted to establish intragenomic diversity. The Sanger sequencing was conducted by Starseq® (GENterprise LLC; Mainz, Germany). As sequencing primers the M13 tails were used (Table 3), following Ivanova et al. (2007). M13 tails consist of 17–18 bases that are attached at the 5′ end of the regular PCR primer during oligo synthesis. The M13 sequences become amplified at both ends of the PCR product and subsequently can be used as sequencing primers. This prevents loss of sequence information compared to the use of normal internal sequencing primers. As M13 tails can be attached to any primer, only one pair of sequencing primers are necessary regardless of the PCR primers used.

The sequences were edited in ChromasPro (Technelysium Pty. Ltd.; Tewantin, Australia), aligned using ClustalW (Larkin et al. 2007), and manually improved in BioEdit (Hall 1999).

Sequences for intragenomic comparisons were generated with a 454 sequencer (454 Life Sciences; Roche Company; Branford, CT) using GS FLX Titanium® chemistry, following the manufacturer’s instructions. All sequences were compared against the reference sequence database created via Sanger sequencing. Only sequences with a complete primer sequence and longer than 250 bp were included.

Statistics

For analysis of the intraspecific and intrageneric variation, sequences from Sanger sequencing (35 sequences; Table 1, EMBL accession numbers FR873231 to FR873265) were used and complemented with sequences downloaded from EMBL (164 sequences; Table 1, all remaining EMBL accession numbers).

Uncorrected p-distances were computed using both DOINK (J. Ehrman, Digital Microscopy Facility, Mount Allison University, Sackville, NB, Canada) and PAUP 4.0b10 (Swofford 2002), as the former program cannot interpret ambiguity coding, whereas the latter does not distinguish between gaps and missing data. The significance of the divergence between intraspecific and intrageneric genetic distances was tested with the Wilcoxon rank-sum test using R (R Development Core Team 2005).

Results

DNA isolation

Non-destructive DNA isolation with the Dynal® DynaBeads generally yielded more DNA (up to 50%; details available from the authors upon request) than isolation with the Qiagen® Dneasy Plant Mini Kit for which the diatom frustules were crushed before the extraction procedure was started.

PCR protocol

First the entire 18S rRNA gene was screened for genetic variability between several diatom taxa for barcoding purposes. Then different fragments of high variability, short enough to be sequenced in one read (454 and Sanger), were tested for universal primer binding sites, PCR amplification and sequencing success. A summary of amplification and sequencing success, fragment lengths and variable positions within a fragment is given in Table 4. Among the tested primer pairs, D512for 18S and D978rev 18S as well as their M13 derivates were successful in 100% of the tested taxa in both amplification and sequencing, with the PCR regime given below gaining the most PCR products. All other primer pairings were less suitable as barcoding primers due to poorer amplification and sequencing success and/or to a worse fragment length/variability ratio (Table 4, Fig. 1). Furthermore, the fragment enclosed by the D512for/D978rev primer pair is short enough to be sequenced in one read and has at least 60 putatively variable basepair (bp) positions. The automated primer design software SeqState (Müller 2005) also favoured the application of this primer pair.

Table 4 Percentage of successful amplifications (annealing temperature regime 1 (52–50°C) / regime 2 (54–52°C)) and percentage of successful sequences of the amplificates from PCR regimes 1 and 2 in all 35 taxa; fragment length and number of variable positions on the given fragment following Alverson et al. (2006) for each primer pair
Fig. 1
figure 1

Consensus secondary structure of the 18S locus (SSU rRNA gene) in diatoms (181 sequences), based on the Toxarium undulatum 18S secondary structure model as reference sequence. Upper-case letters indicate that nucleotides at corresponding positions are conserved in 98–100% of sequences, lower-case letters indicate 90–98% conservation, dots 80–90% conservation, circles indicate greater than 20% variability. V4 region and primer binding sites (see Table 3) shown highlighted and in brackets. Where primers overlap their names and brackets are numbered accordingly. Tags at V4 region indicate indels relative to Toxarium undulatum sequence; tag format is (maximum length of indel: percentage of sequences showing length polymorphisms). Figure modified after Alverson et al. (2006) and Gillespie et al. (2006)

DNA sequencing

Sanger sequencing produced sequences of 35 taxa from unialgal cultures (Table 1, EMBL accession numbers FR873231 to FR873265).

The number of generated sequences (454 sequencing) for calculating the intragenomic variation varies between 16 and 112 per taxon (total 1010; Table 5). All sequences >250 bp from the 454 run could be assigned unambiguously to one of the reference sequences from the Sanger sequencing.

Table 5 Uncorrected p-distances given as average, minimum and maximum values; n = number of sequences (per strain) or number of individuals (per species/genus), respectively

Genetic distances and statistics

To analyse genetic distances between and within strains (several sequences analysed for one unialgal culture), species and genera for the proposed 18S rRNA gene fragment (V4), uncorrected p-distances were calculated. The average, minimum and maximum p-distance values are given in Table 5. Average genetic distance within one strain varied between p = 0.000 (Nitzschia acicularis, N. linearis) and p = 0.005 (Hantzschia amphioxys). Intraspecific variation also ranged between p = 0.000 (e.g. Achnanthidium minutissimum) and p = 0.005 (Nitzschia pusilla, Pinnularia mesolepta, Stauroneis kriegeri). Intrageneric distance varied between p = 0.011 (Mayamaea spp.) and p = 0.174 (Melosira spp.), except for Stephanodiscus spp., in which the average intrageneric variation was only p = 0.001 (Table 5). Except for Stephanodiscus, intrageneric (heterospecific) variation was always higher than both, intraspecific variation and the variation within each strain (for example, intraspecific variation in Aulacoseira varied between p = 0.000 and p = 0.001 while intrageneric distance was p = 0.048; Table 5). The Wilcoxon rank-sum test showed that the genetic distances within the species of the 16 tested genera (Table 5) is significantly lower than between the single species in these genera (\( p = {2}.{2} \times {1}{0^{{ - {16}}}} \); Fig. 2).

Fig. 2
figure 2

Box-and-whisker plot of intraspecific and intrageneric (x-axis) genetic distances measured in uncorrected p-distances (y-axis). Thick black lines indicate median values, boxes represent upper and lower quartiles, whiskers indicate value ranges, circles represent outliers

Genetic distance among taxa in Sellaphora ranged between p = 0.003 (Sellaphora blackfordensis/Sellaphora pupula phenodeme southern pseudocapitate) and p = 0.087 (Sellaphora cf. minima/Sellaphora pupula phenodeme europa), with an average p = 0.039 (Table 6). The average intraspecific genetic distance within Sellaphora laevissima is p = 0.005 (min. p = 0.000, max. p = 0.007; number of sequences = 3; Table 6); within Sellaphora pupula phenodeme elliptical it is p = 0.000 (number of sequences = 2; Table 6).

Table 6 Uncorrected p-distances among tested taxa (clones and/or phenodemes) in the genus Sellaphora. Values in boxes labeled a–e are discussed in detail

Discussion

The analysis of environmental samples via DNA barcoding needs to facilitate the detection of—in this case diatom—diversity as well as the identification of species present in the respective sample. For the first part a standard laboratory protocol (including universal primers) is essential, for the second a critical assessment of intra- versus interspecific variation is needed.

Standard laboratory protocol

The development of a standard laboratory protocol considered DNA extraction as well as fragment amplification and sequencing including primer design. The DNA extraction using Dynal® DynaBeads is a non-destructive process that leaves the frustules intact and available for microscopic examination and taxonomic determination, e.g. if species have not yet been deposited in a reference database and morphological vouchers have to be cross-checked after sequencing or if mixed samples have to be analysed microscopically and valves have to be counted for quantification. Even if the Qiagen® Dneasy Plant Mini Kit is used non-destructively it includes more centrifuging steps that could damage especially the larger diatom frustules or fragile frustule characteristics that can be crucial for identification.

Concerning the Dynal® DynaBeads method it has to be noted that after the extraction the residue containing the frustules has to be centrifuged, the supernatant removed, and replaced by pH neutral storing buffer. Otherwise the frustules might be dissolved. The DNA yield is higher than with the Qiagen® Dneasy Plant Mini Kit. Because of the better performance and the conservation of the frustules, the non-destructive DNA isolation was chosen.

Of the six different primer pairs that were tested, D512for 18S and D978rev 18S, as well as their M13 variants, were the most successful with respect to amplification and sequencing success, and exhibited the best fragment length/variability ratio (Table 4, Fig. 1). PCR amplification with primers D512for 18S and D978rev 18S was successful in all taxa in our study and in many other taxa (e.g. Skeletonema spp., Phaeodactylum spp., Surirella spp., Campylodiscus spp.; authors’ unpublished data). This high amplification efficiency is due to the placement of the primers in highly conserved stemloop sections of the 18S rRNA gene (Fig. 1) that exhibit low mutation rates and are conserved across a wide range of diatom taxa, therefore make ideal binding sites for universal primers. The M13 tails were used as universal sequencing primers (Ivanova et al. 2007), which contributed to the high sequencing success.

Importantly, the primer combination D512for 18S and D978rev 18S includes the highly variable V4 region of the 18S rRNA gene (Fig. 1) which encloses many indel regions that contribute to the increased information level on this short fragment (Alverson et al. 2006). The other tested primer pairs also result in short variable segments, but with lower universality concerning the laboratory success. The fragments are also less variable, thus do not allow species-level identification within diatoms (Fig. 1).

Besides the primer universality, the V4 region has another promising feature for barcoding environmental samples: The association of the sequences produced by 454 sequencing to the reference data generated via Sanger sequencing was always unambiguously possible—due to the systematic selection procedure—without much computing and editing effort after sequencing. In addition, no problems emerged in the present study concerning homopolymer errors in the sequences as are often encountered when applying pyrosequencing (Huse et al. 2007).

For high-throughput studies it is also important that the barcode does not exceed a certain length, currently around 400 bp. This length keeps increasing along with the development of sequencing techniques and computation capacity (Schloss 2010), but the cost of sequencing increases accordingly. This is one reason why Hajibabaei et al. (2006b) proposed a 100 bp barcode, which would also work with high-throughput technologies that only produce shorter read length such as Illumina. The V4 region (Fig. 1) in itself is only about 60 bp long, so that it could qualify as such a short barcode without losing its resolving power. Some studies already use very short sequences to evaluate prokaryotic diversity in environmental samples (Huber et al. 2009; Huse et al. 2007; Schloss 2010).

For these reasons, standard laboratory protocols, primer universality, informational indels on a short fragment, the V4 region—maybe only a 60 bp part of it—show high potential for the use in fast, high-throughput approaches to environmental barcoding using next-generation sequencing.

Species identification

For the assessment of the 18S fragment’s power to resolve taxa at species level, uncorrected p-distances were used. All species tested in this study feature uniform sequences allowing unambiguous resolution at species level, with the only exception concerning Stephanodiscus. This genus is well known as problematic in morphological discriminations due to small size of the individuals and to valve plasticity which is often overlapping between species (Håkansson and Kling 1989; 1990; Kobayasi et al. 1985; Spamer and Theriot 1997; Teubner 1997; Wolf et al. 2002). Molecular species identification in Stephanodiscus is also difficult (Moniz and Kaczmarska 2009, 2010), possibly because some taxa have diverged only very recently, e.g. S. niagarae and S. yellowstonensis about 12.000 to 8.000 years ago (Zechman et al. 1994).

Intraspecific variation was very low in general, not exceeding p = 0.005 (Hantzschia amphioxys, Table 5). Intrageneric variation was significantly higher than intraspecific variation in all cases (Table 5). This leads to the assumption that, even though the p-distances are comparatively low compared to other markers (e.g. Huang et al. 2007; Wu et al. 2008; Xia et al. 2003), the 18S fragment (V4) used in the present study still has informative value as a barcoding marker to resolve taxa at the species level.

So far, the resolving ability of a given barcode marker has been assessed using either a fixed threshold or the concept of the “barcode gap” (Hollingsworth et al. 2009), meaning a well-defined difference between the levels of intra- and interspecific variation, often calculated by means of a ratio. Initially some studies used a 10-fold increase to gauge the applicability of a certain marker (Hebert et al. 2003). More recently, however, it has been shown that taxa differ considerably in their genetic variation, so that different studies now use very different ratios and thresholds depending on the respective organism group and marker (e.g. Cywinska et al. 2006; Hajibabaei et al. 2006a; b; Hebert et al. 2004; Hickerson et al. 2006; Meyer and Paulay 2005; Ward et al. 2005). For the cox1 gene a threshold of p = 0.04 is considered sufficient in red algae (Saunders 2005), for the ciliate genus Tetrahymena p = 0.11 (Chantangsi et al. 2007), and for Paramecium p = 0.20 (Barth et al. 2006). Moniz and Kaczmarska (2009) give a minimum intrageneric distance of p = 0.07 for a combination of the 5.8S rRNA gene and ITS2 within diatoms.

The variation in the 18S rRNA gene has been considered as too low for a barcoding marker in diatoms (Moniz and Kaczmarska 2009, 2010). This, however, refers to the complete 18S locus, which is much longer (1800 bp) than the one used in the present study (ca. 390–410 bp). As most of the 1800 bp fragment comprises extremely conserved regions, the genetic distance between species is reduced if the complete 18S rRNA gene locus is used. In the present study the region responsible for species identification is mainly the only ca. 60 bp long V4 region (Fig. 1). As mentioned above, the V4 region comprises not only many variable character sites but also many inversions, insertions and deletions, resulting in a highly concentrated information content on a very short fragment (Alverson et al. 2006).

The V4 region appears to allow discrimination between species to a degree sufficient for environmental DNA barcoding. Therefore, to further test the power of this region for species identification in a closely related taxon complex, an exclusive in silico analysis within the Sellaphora pupula-group and sister taxa was performed. The genus Sellaphora is a genus with well-established species concepts and extensive data on mating behaviour, morphology, ecology and DNA sequence variation within the genus (Evans et al. 2007, 2008). The Sellaphora pupula-group consists of very closely related species, thus provides a strong test of the reliability of the proposed barcode region. The V4 region was able to discriminate between all the included taxa (following Evans et al. 2008).

There are some taxon pairs with very low genetic distances (Table 6, b–d), one of them comprising Sellaphora blackfordensis and S. pupula clone AUS4 phenodome southern pseudocapitate, (Table 6, b), the second S. blackfordensis and S. pupula clone AUS1 phenodome southern capitate (Table 6, c). These three taxa also form a well-supported clade in the rbcL-based phylogenetic tree provided by Evans et al. (2008). The third such pair contains Sellaphora lanceolata and S. bacillum (Table 6, d), showing a relationship which is consistent with the findings of Evans et al. (2008) as well. That the genomic variation between these pairs is lower than or similar to the variation within Sellaphora laevissima could indicate, for instance, that the V4 region is not powerful enough to distinguish between all cryptic species, or that the species circumscriptions do not necessarily reflect the genetic diversity.

Within the former Sellaphora pupula taxon there are two identical sequences (Table 6, e), both designated as S. pupula phenodeme elliptical by Evans et al. (2008). Whether the genetic distances between these phenodemes represent population differences or variation between cryptic species needs further consideration (e.g. Evans et al. 2008). This shows that the V4 region also may have some potential for identifying closely related species, even though it might not be enough for defining them.

The V4 region of the 18S locus as a barcode marker

Various other barcodes have been proposed for various groups of organisms, among them the plastid regions rbcL, matK, trnH-psbA, the 23S rRNA gene, the mitochondrial gene cox1, and the nuclear markers ITS, entire 18S (SSU) rRNA gene and 28S (LSU) rRNA gene (e.g. Bhadury et al. 2006; Fazekas et al. 2008; Hebert et al. 2004; Hollingsworth et al. 2009; Kress and Erickson 2007; Kress et al. 2005; Newmaster et al. 2008; Summerbell et al. 2005). However, cox1, ITS, 18S and rbcL are the only ones which have been applied to diatoms, with mixed results, i.e. cox1 was very variable but no universal primers could be found, ITS was variable but is not universally amplifiable with standard laboratory protocols, rbcL was less variable, and 18S (whole gene) was not variable enough (e.g. Evans et al. 2007, 2008; Jahn et al. 2007; Moniz and Kaczmarska 2009, 2010).

That the cox1 gene is variable enough to discriminate between very similar taxa (e.g. cryptic species) has been stated for many groups throughout the tree of life (Barth et al. 2006; Chantangsi et al. 2007; Hebert et al. 2003; Kucera and Saunders 2008; Lynn and Strüder-Kypke 2006; Saunders 2005). However, a preliminary study using a dataset of over 60 diatom species from various groups to design universal primers for the cox1 gene (unpublished data) showed that it is virtually impossible to do so, because the locus lacks sufficiently conserved regions for primer binding. Universal primers constitute an essential condition for environmental analyses. Various publications have shown that this problem occurs not only within diatoms (e.g. Evans et al. 2007, 2008; Moniz and Kaczmarska 2009) but also in many other eukaryotic organism groups, e.g. in land plants (Cowan et al. 2006), dinoflagellates (Ferrell and Beaton 2007), gastropods (Kane et al. 2008), and fungi (Seifert et al. 2007). Most studies on the use of the cox1 gene as a barcoding marker for protists are limited to very confined groups, e.g. genera, and use group-specific primers (Chantangsi et al. 2007; Evans et al. 2007, 2008). In diatoms this high variability of the cox1 locus could be due to the occurrence of intron events and introgression of bacterial genes, both common in diatoms (Armbrust et al. 2004; Bowler et al. 2008; Ehara et al. 2000; Imanian et al. 2007; Ravin et al. 2010).

The combination of the 5.8S rRNA gene and ITS2 has been suggested as an alternative barcoding locus (Moniz and Kaczmarska 2009, 2010). Its potential to identify species is promising and has been demonstrated in many protists, fungi and plant groups (e.g. Gemeinholzer et al. 2006; Kelly et al. 2010; Litaker et al. 2007; Taylor et al. 2008). There are, however, some problems, the main one being that ITS is not easy to amplify and sequence with standard laboratory protocols (unpublished data; see also Hamsher et al. 2011). Furthermore, studies in fungi using ITS suggested that errors in amplification/sequencing—especially in high throughput—could easily lead to overestimation of diversity in environmental samples (Bellemain et al. 2010).

Plastid markers such as the rbcL gene could be problematic for DNA barcoding, as the plastid inheritance in diatoms is not uniform but can be either uniparental or biparental (Casteleyn et al. 2009; Jensen et al. 2003; Levialdi Ghiron et al. 2008; Round et al. 1990), and there are rare reports of natural hybrids (Casteleyn et al. 2009).

The 18S rRNA gene locus is often used to estimate the relative abundances and diversities of species in environmental samples (Liao et al. 2007), due to its low intraspecific but high interspecific variation. It also has been used to define operational taxonomic units (OTUs) in various eukaryots (Ciliophora, Dinophyceae, Cercozoa und Fungi; Lefèvre et al. 2007). The analysis of water samples via a 550 bp long fragment of the 18S rRNA gene locus was able to resolve organisms of the metazoans (e.g. nematodes), the algae Prasinophyceae, Cryptophyceae, Dinophyceae and Prymnesiophyceae, as well as heterotrophic Cercozoa, Choanoflagellates, Stramenopiles, and Cilitates (Romari and Vaulot 2004 ). It has been shown that the 18S rRNA gene can also discriminate diatoms in most cases of environmental samples, often to the species level (Jahn et al. 2007; Savin et al. 2004).

The main advantage of the V4 fragment of the 18S locus is that it is very easy to amplify with the proposed universal primers using our documented standard laboratory protocol, while it still has considerable power to resolve taxa on the species level. Both of these characteristics are crucial for its successful use in environmental studies. The potential of the V4 fragment to discriminate between (semi-)cryptic species has to be further evaluated. However, while this aspect is desirable it is not necessary for its use in environmental studies, as the members of cryptic-species complexes generally seem to have similar ecology (Beszteri et al. 2005a,b, 2007).

A further advantage of the 18S locus is its high representation in databases. A good retrieval rate for correct identifications strongly depends on the reference data. But even though the reference database for the 18S rRNA gene is more extensive than for many other proposed barcode regions, it nevertheless has to be extended, especially with voucher-based sequences.

Conclusions

The crucial problem in selecting an applicable barcode is the balance between variability and primer-binding universality. For the analysis of environmental samples primer universality and reproducible laboratory protocols are of high importance, whereas for the detection and delimitation of cryptic species these aspects are often secondary.

For the detection of cryptic species other, more variable barcodes might be more feasible. But as discussed in many other studies, some problems, such as species delimitation and α-taxonomy, presumably cannot be solved with only one barcode (e.g. Chase et al. 2007; Cowan et al. 2006; Kress and Erickson 2007). A single barcode represents only a fraction of an organism’s variation; therefore its power to define a taxon should not be overestimated. Consequently, a combination of the V4 region with other barcodes such as ITS should be discussed.

The 18S rRNA gene fragment proposed in the present study shows enough variation to unambiguously identify almost all tested taxa. Furthermore, the highly conserved primer binding sites allow amplification following a standard procedure. Due to its relatively short length it is also feasible for time- and cost-saving high-throughput analysis methods. The V4 region of the 18S locus therefore is a good candidate for barcoding diatoms in environmental samples.