Introduction

The internal transcribed spacer (ITS) of the nuclear ribosomal DNA cistron is the barcoding locus for Fungi (Schoch et al. 2012), because of its versatility in resolving the species-level clades and the large amount of ITS sequence data already available (Pinto et al. 2004; Rossman 2007; Seifert 2008; Taylor and McCormick 2008; Eberhardt 2010; Begerow et al. 2010; Del-Prado et al. 2010; Hibbett et al. 2011; Schoch et al. 2012). Nuclear ribosomal DNA occurs in multiple copies in the genome and is easy to amplify from small quantities of template material. This is also disadvantageous, since multiple copies open up the possibility of paralogs and pseudogenes that do not reflect species phylogenies (Simon and Weiss 2008; Nilsson et al. 2008; Kovács et al. 2011a, b; Kiss 2012). It is assumed that the evolution of paralogs and pseudogenes in nuclear ribosomal DNA is prevented by concerted evolution (Hurst and Smith 1998; Liao 1999, 2008; Ganley and Kobayashi 2007). Yet, an increasing number of cases are being reported that suggest ITS paralogs in Fungi (Hughes and Petersen 2001; Ko and Jung 2002; Gomes et al. 2002; Smith et al. 2007; Simon and Weiss 2008; Nilsson et al. 2008; James et al. 2009; Lindner and Banik 2011; Kovács et al. 2011a, b; Li et al. 2013; Lindner et al. 2013). In plants, such cases were mostly interpreted as the result of hybridization (Hughes and Petersen 2001; Won and Renner 2005; Gulyas et al. 2005). On the other hand, it is unclear to what extent intragenomic variation detected through cloning, restriction fragment length polymorphism, and next-generation sequencing (NGS) (Gomes et al. 2002; Smith et al. 2007; Lindner and Banik 2011; Lindner et al. 2013) is real or artifactual. Critical screening of ITS Sanger sequences for selected taxa reveals that sequencing errors are frequent (Lücking et al. 2012; Moncada et al. 2013). On the other hand, approaches using specific primers for ITS pseudogenes (Li et al. 2013) leave little doubt that such pseudogenes are a reality at least in some fungal lineages.

Next-generation sequencing offers a powerful approach to the analysis of potential intragenomic variation on multicopy genes. It produces a large amount of sequence data in short time at low cost and allows for discrimination of biological entities within a single source without cloning (Margulies et al. 2005; Ahmadian et al. 2006; Rothberg and Leamon 2008; Nilsson et al. 2011). Apart from genome sequencing, these approaches are useful for analyzing microbial and fungal communities at the environmental level (O’Brien et al. 2005; Sogin et al. 2006; Geml et al. 2008; Taylor et al. 2008; Buée et al. 2009; Amend et al. 2010; Lumini et al. 2010; Wallander et al. 2010; Hibbett et al. 2011; Unterseher et al. 2011; Dai et al. 2012; McGuire et al. 2012). On the downside, NGS presents a number of problems, the most important being the huge amount of data generated, which requires approaches to data analysis that rely mostly on automated clustering (Li and Godzik 2006; Schloss et al. 2009; Edgar 2010, 2013; Caporaso et al. 2010a; Huang et al. 2010; Huse et al. 2010; Kumar et al. 2011; Nilsson et al. 2011). Since clusters and consensus or seed sequences are based on overall similarity derived from pairwise alignments, testing the quality and consistency of individual reads becomes a challenge, because potential errors are masked. One of the most problematic issues is the overestimation and misinterpretation of biological units present in a sample, be it taxa or gene copies (Quince et al. 2009; Engelbrektson et al. 2010; Kunin et al. 2010; Porter and Golding 2011; Powell et al. 2011; Unterseher et al. 2011; Zhou et al. 2011). Improved filtering algorithms have reduced the number of errors, but they cannot guarantee that sequence reads are error-free, and a large amount of reads already generated and deposited in public databases such as the sequence read archive (SRA) have not undergone rigorous error testing.

Various papers have reported and analyzed potential errors generated by 454 pyrosequencing, including duplication, chimaeras, substitution errors caused by over- and under-calls, and length variation of individual reads due to indels not present in corresponding Sanger sequences (Margulies et al. 2005; Huse et al. 2007; Quince et al. 2008; Hoff 2009; Gomez-Alvarez et al. 2009; Kunin et al. 2010; Niu et al. 2010; Tedersoo et al. 2010; Balzer et al. 2011). To filter potential sequencing errors or erroneous reads, flow values from the 454 pyrosequencing process are transformed to quality scores and used to detect low-quality reads prior to data analysis or to weight reads in clustering approaches (Huse et al. 2010). Other denoising methods use rank-abundance distribution of reads or a likelihood approach (Reeder and Knight 2010; Quince et al. 2011). Studies using duplicated 454 pyrosequencing runs from the same samples found no significant differences in the allocation of sequences to taxonomic units, suggesting that at least part of the detected variation is biological (Kauserud et al. 2012; Kumar et al. 2011). This was also supported by cloning techniques (Tedersoo et al. 2010; Kauserud et al. 2012), suggesting that 454 pyrosequencing errors are low relative to biological variation. However, the assumption that length variation in non-singleton clusters is biological, while in singletons it is artifactual (Kunin et al. 2010; Tedersoo et al. 2010; Unterseher et al. 2011), is ill-defined, since clustering methods are unable to distinguish between biological and artifactual sources of length variation, and hence cluster sizes cannot be used to categorize such variation. If the frequency of sequence reads representing separate taxa approximately reflects the abundance of the taxa themselves, with variation depending on PCR settings, it would in fact be natural to assume that a large number of biological taxa in a NGS dataset are represented by singletons only, a notion pointed out by Unterseher et al. (2011).

A major problem of 454 pyrosequencing is the accurate detection of homopolymer segments. Since these are sequenced in a single step, the only way to determine the number of bases per segment is by measuring the chemiluminescent signal strength; the stronger the signal, the more bases were added in a given step. Signal strength is thereby calibrated against an initial key of four bases. In order to produce a signal that correctly represents the homopolymer length at a given position, the fragments present in a well must be completely extended, which depends on the number of DNA templates in the well, homopolymer length, and the amount of fluorescent base molecules flooded during each cycle. In reality, the extension follows a Poisson distribution for a given well; most fragments will be fully extended, whereas a small portion will only be partially extended. This leads to a suboptimal light signal for the well which will then be interpreted as a homopolymer of shorter length (Margulies et al. 2005; Huse et al. 2007; Gomez-Alvarez et al. 2009; Kunin et al. 2010; Niu et al. 2010; Tedersoo et al. 2010; Balzer et al. 2011). In addition, the DNA templates in the well that are not fully complemented become desynchronized and will be completed during the next cycle with the same flow base. Depending on calibration, this additional signal may disappear as background noise or the sequences are filtered out, since the additional signal is below the calibrated threshold. However, if the proportion of desynchronized templates is sufficiently large, the well can produce an additional signal interpreted as an extra base call. Since flow cycles of the same base are separated by other base flow cycles, this extra base call appears several bases after is corresponding homopolymer and hence is not detectable as an error. This is known as carry-forward-incomplete-extension (CaFIE) error; it is rarely mentioned or accounted for correctly in 454 studies and usually not considered a significant source of error, and sometimes even assumed to be automatically corrected by the analytical software filtering the reads (454 Life Sciences Corporation 2009; Quince et al. 2009, 2011; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011; Keegan et al. 2012). Two recent studies proposed automated pipelines including corrections for CaFIE errors (RC454: Henn et al. 2012; ICC: Deng et al. 2013); however, upon testing, these either did not work properly (RC454) or did not provide a full set of corrected reads as results (ICC). Also, both methods rely on quality scores obtained from the sequencing process to perform corrections, although CaFIE errors technically cannot be detected using quality scores.

As part of a 454 study using the ITS1 region of the fungal ITS barcoding locus to detect fungal inhabitants and contamination in thalli of the lichenized basidiomycete genus Dictyonema and its allies (Lawrey et al. 2009; Dal-Forno et al. 2013), we obtained a large set of target mycobiont reads from three samples belonging to a single species, Cora inversa, but with a large amount of variation of initially undetermined origin. Clustering techniques suggested this variation to represent a large number of biological entities, whereas multiple-alignment-based phylogenetic methods using a maximum likelihood approach clustered them into a single species. To resolve this conflict, we analyzed the same samples repeatedly under different conditions, to discriminate between biological and artifactual variation and to quantify the proportion of each within the studied samples.

Materials and Methods

Both Sanger and 454 sequence data were obtained from three samples of C. inversa from Colombia (Table 1), as part of a larger set of samples run together in two sequencing runs, MTPS25 and MTPS57. The samples were targeted using fungal-specific primers for the entire ITS (Sanger) and the ITS1 region (454), the latter producing reads of a target length of 325 bases length for this taxon. The ITS was selected since it is the fungal barcoding locus (Hebert et al. 2003; Schoch et al. 2012) and since we used it successfully to address genus and species delimitation in the Dictyonema clade (Lawrey et al. 2009; Dal-Forno et al. 2013).

Table 1 Genbank and SRA and BioProject accession numbers for sequences used in this study. The SRA accession number for the entire dataset is SRP033721

Genomic DNA was extracted from lichenized thalli using the Bio 101 Fast DNA Spin Kit for tissue (Qbiogen, Illkirch, France). About 10 ng of extracted DNA were subjected to a standard PCR in a 25 μL reaction volume using either Taq Gold polymerase (Applied Biosystems, Foster City, California, USA) or Bio-X-Act Long Mix (Bioline USA, Taunton, Massachusetts, USA). Sequence data were obtained from the ITS using the universal ITS primers ITS5 (forward) and ITS4 (reverse) for Sanger sequencing (ITS1, 5.8S, ITS2), and the primers ITS1F (forward) and ITS2 (reverse) (ITS1 region) for 454 pyrosequencing (http://www.biology.duke.edu/fungi/mycolab/primers.htm). After visualizing the PCR products with ethidium bromide on a 1 % agarose gel, the products were purified with magnetic beads (Agencourt Biosciences, Beverly, MA, USA) and used in standard sequencing reactions for Sanger sequencing with BigDye Terminator Ready Reaction Mix v3.1 (Applied Biosystems). The sequencing reactions were then purified using Sephedex G-50 (Sigma-Aldrich, St. Louis, MO, USA), dried in a speedvac, denatured in HiDi Formamide (Applied Biosystems) and run on an ABI3130-xl capillary sequencer (Applied Biosystems). The data collected were analyzed using ABI Sequence Analysis software and sequences were assembled with Sequencher v4.7 software (GeneCodes, Ann Arbor, MI, USA). Sanger ITS sequences were aligned with sequences of other species of Cora obtained from GenBank (same dataset as in Dal-Forno et al. 2013).

For 454 pyrosequencing, the samples were initially amplified with ITS primers, ITS1F (forward) and ITS2 (reverse) primers attached to a FAM label, allowing us to fingerprint the products to confirm the size and distribution of the community (Sikaroodi and Gillevet 2012). The forward and reverse primers had different custom tags and an A and B adapter, respectively (titanium adapters from the 454 pyrosequencing protocol), making it possible to pool different samples into one emulsion PCR. After confirming the PCR products on a 1 % agarose gel with ethidium bromide and fingerprinting them, the products were pooled and subjected to an emulsion PCR using the 454 emPCR_LibL kit, with varying number of cycles (see below). Pyrosequencing was done on the recovered beads from the emulsion PCR using 454 sequencing kits. The sequence results were separated into bins for different tags using a custom made perl script (Sikaroodi and Gillevet 2012).

Three different combinations of settings and conditions were used for each repeat emulsion PCR and 454 pyrosequencing run, all based on the same community PCR and hence on the same biological templates for the three samples of C. inversa: (1) initial emulsion PCR plus initial 454 pyrosequencing settings (repeat 1 in sequencing run MTPS25); (2) initial emulsion PCR plus modified 454 pyrosequencing settings (repeat 2A in sequencing run MTPS57); (3) de-novo emulsion PCR plus the same modified 454 pyrosequencing settings (repeat 2B in the same sequencing run MTPS57). The modified settings mainly concerned optimization between repeats 1 and 2A/B. Since tests showed that the suggested standard 50 cycles for emulsion PCR resulted in less reads passing the quality filters and hence less yield and higher proportion of sequening errors, we modified the settings to 45 cycles of emulsion PCR for repeats 1 and 2A (initial emulsion PCR) versus 40 cycles for repeat 2B (de-novo emulsion PCR). As a result, repeats 1 and 2A produced approximately 5 % bead recovery, with all beads loaded on the sequencing plate, whereas repeat 2B resulted in approximately 10 % bead recovery, half of which were loaded on the sequencing plate. In addition, between sequencing runs 1 (MTPS25) and 2A/B (MTPS57), the valves of the GS Junior were replaced. Since yield and proportional read quality improved consistently between repeats 1 and 2A/B (from less than 40 % to over 70 % yield), the overall quality of the sequencing results was categorized as moderate (repeat 1) and good (2A/B). As a consequence, substantial differences in the porportional frequency of read variation patterns (e.g. the same indel at the same position) detected between repeats 1 and 2A would have to be attributed to systematic sequencing errors, whereas differences between repeats 2A and 2B would be interpreted as caused by variation in the emulsion PCR settings. In contrast, patterns of real biological variation would result in comparable proportional frequencies across repeats or (slightly) increased frequency in repeats 2A/B, since it can be assumed that repeats 2A/B would yield a higher proportion of accurate reads compared to repeat 1.

Since, the 454 pyrosequencing amplifies and sequences all fungi present in the samples, the resulting reads were subjected to a filtering process to select only those ITS1 reads corresponding to C. inversa. The reads were "pre-digested:" by merging all reads from the three samples into a single fasta-formatted file. This dataset was analyzed using a combined USearch-Perl-Windows script in batch format (Online Appendix A). The script uses USearch (Edgar 2010, 2013) and the Perl script fasta-to-tab-delimited format (and viceversa) provided by the FAS Center for Systems Biology of Harvard University (http://sysbio.harvard.edu/csb/resources/computational/scriptome/Windows/Tools/Change.html), as well as the Windows findstr command. It arranges all reads according to length and then clusters the reads at a prespecified ID level, producing a combined fasta-formatted file containing the reference read (the longest read) for each cluster. We used 10.0, 0.1, 1.0, and 0.1 as internal and external gap opening and gap extension penalties (Edgar 2010), to minimize the formation of separate clusters based on read length rather than base sequence patterns. The ID level was set to 0.90 to minimize splitting of closely related taxa into separate clusters, while at the same time making the clusters consistent enough to allow for de-novo alignment of the contained reads. The script generates a modified fasta file of all reads, sorted by their cluster assignments from the previous step, which is aligned using the quick alignment algorithm provided by USearch StarAlign (Edgar 2010) and then converted to a tab-delimited text file, and the contained clusters are parsed out into separate fasta files. In addition to parsing the entire dataset into separate clusters at ID = 0.90, we also analyzed cluster formation of the dataset at sequential ID levels (0.50, 0.60, 0.70, 0.80, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99) to obtain the number of potential OTUs obtained at each level.

To provide taxonomic identifications of the reads and extract those representing C. inversa, each of the clusters obtained at the ID = 0.90 level was batch-blasted (i.e. with blast results for each individual read) against a curated database combined from IAB and UNITE (Kõljalg et al. 2005, 2013; Nilsson et al. 2005, 2006, 2008, 2009; Abe et al. 2011) plus a Sanger sequence database obtained from a large number of Dictyonema s.lat. samples including C. inversa. The clusters were also batch-blasted against GenBank sequences using a threefold approach: (1) including, (2) excluding uncultured/environmental sample sequences, and (3) if the best hit from (1) was an uncultured/environmental sequence that sequence was blasted again excluding uncultured/environmental sample sequences. Thus, we obtained four separate blast results for each read and its taxonomic identity was determined based on consistency of these results. The reads were then separated into (A) those blasting with C. inversa and other species of Dictyonema, (B) those blasting with closely related Basidiomycota, and (C) those blasting with other Basidiomycota and Ascomycota or lower fungi. While the reads of group (C) were not further considered, those of group (B) were subjected to a more specific local blast including only Dictyonema and closely related Basidiomycota (Agaricales) as reference sequences, to detect further reads representing Dictyonema which were then added to group (A).

All reads of group (A) were then aligned to a Sanger ITS reference alignment representing all sequenced species of Dictyonema s.lat. (including Cora; Dal-Forno et al. 2013) using PaPaRa 2.0 (Berger and Stamatakis 2011; Berger et al. 2011). The alignment was analyzed phylogenetically with a maximum likelihood approach using the RAxML-HPC BlackBox 7.3.2 on the Cipres Gateway server (Stamatakis 2006; Stamatakis et al. 2005, 2008; Miller et al. 2010), employing a GTR-Gamma model, with parametric bootstrapping using an automatically determined number of replicates as implemented by the RAxML-HPC BlackBox by means of a saturation assessment. A total of 16,665 reads was found to cluster with the Sanger reference sequences of C. inversa and these were retained for further study.

The retained 454 ITS1 reads of C. inversa were realigned with a Sanger ITS reference alignment of the same species to determine indel sites. MAFFT (Katoh et al. 2002, 2009) was unable to accurately align the length variable ITS1 reads, failing to recognize all homologous base and indel positions. Instead, accurate results were achieved with PaPaRa 2.0 (Berger and Stamatakis 2011), which aligns length variable 454 reads to a known Sanger reference alignment, introducing gaps into the reference alignment where the 454 reads contain additional base calls. For this purpose, PaPaRa uses a probabilistic gap model based on the fixed alignment and the best-scoring maximum-likelihood tree derived via RAxML (Stamatakis 2006) from that alignment. This method produces accurate alignments superior to those generated with MAFFT, and the position of insertions is easily discernible. In addition, PaPaRa allows for the generation of insertion-free alignments of the 454 reads by optionally deleting all insertion sites not present in the Sanger reference sequences. PaPaRa was also used to compute read lengths with and without insertions for each 454 read, and to calculate the relative proportion of insertions across all reads.

The "Positional Nucleotide Numerical Summary" function in BioEdit 7.0.9 (Hall 1999) was used to compute the base composition at each alignment position to determine the nature of each of the indels across all the reads overall, separated by sample and by run. Indels were classified into two major site categories, homopolymer extensions and heteropolymer insertions. The latter were further analyzed for site-specific variation among reads.

In order to test the effect of indels in the 454 ITS reads on the estimation of taxonomic diversity by means of automated clustering methods, we used another combined USearch-Perl-Windows script (Online Appendix B), which automatically computes the number of taxonomic units based on a predefined threshold similarity level by parsing out the seed sequences for each assumed taxonomic unit into a fasta-formatted file. The script was applied to the complete set of reads corresponding to C. inversa and the following threshold similarity levels were applied: 0.50, 0.60, 0.70, 0.80, 0.90, 0.91, 0.92, 0.93, 0.94, 0.95, 0.96, 0.97, 0.98, 0.99, 1.00. We also randomly selected different numbers of reads (16, 32, 64, 125, 250, 500, 1000, 2000, 4000, 8000, 16000) and ran the script for each set to compute the relationship between read frequency and estimate of biological diversity.

Results

The total yield of the two sequencing runs MTPS25 and MTPS57 was 41,465 and 84,368 reads, respectively, for the samples corresponding the Cora and Dictyonema. After separation of reads corresponding to other species of (endolichenic) fungi or contaminants, and separating the remaining reads according to genera and species, the number of ITS1 reads for the target mycobiont of the three samples of C. inversa and three repeats (1, 2A, 2B) was 16,665, with a range of 1,269–2,465 reads per sample and repeat (Table 2). The largest number of reads was generated for sample DIC154 and repeat 2B; for all samples, repeats 2A/B had a substantially higher yield than repeat 1. The target length for the ITS1 reads according to Sanger sequences from the same samples was 325 bases. The 454 ITS1 reads varied in length from 100 bases (predetermined lower length cut-off) to 349 bases (Fig. 1), with the 22 longest reads found in repeat 1 of samples DIC150 and DIC154. The majority of the reads (88 %) was complete to near-complete, with 300–349 bases length; only 5.7 % of the reads had less than 200 bases (Fig. 1).

Table 2 Number of 454 ITS1 reads obtained for each sample of C. inversa in each of the runs
Fig. 1
figure 1

Histogram of read lengths of the 16,665 454 ITS1 reads corresponding to C. inversa generated from the three samples and three runs

The number of insertion sites per read varied from 0 to 25, with most of the larger values found in reads of repeat 1 of samples DIC150 and DIC154; however, the largest value was found in a read of repeat 1 of sample DIC149 (Table 3). On average, reads from repeats 2A/B had a significantly lower number of insertions (0.20–0.44 per sample) than those from repeat 1 (2.30–3.33 per sample; Kruskal–Wallis ANOVA, p < 0.001). Relative to read length, this difference was even more pronounced, with 0.07–0.15 % (range 0–5.37 %) in reads of repeats 2A/B, and 0.84–1.23 % (range 0–14.4 %) in reads of repeat 1 (Table 4). While some reads had up to 14.4 % of their bases corresponding to insertions, the large majority of reads even in repeat 1 had less than 1 % of insertion sites, corresponding to about 2–3 bases per read.

Table 3 Number of insertions (mean and range) per read obtained for each sample of C. inversa in each of the runs
Table 4 Relative proportion of insertions (mean and range) per read (relative to total read length including insertions) obtained for each sample of C. inversa in each of the runs

The total length of the entire PaPaRa alignment of all 16,665 454 ITS1 reads plus nine corresponding Sanger reference sequences was 591 bases, including 325 sites corresponding to the Sanger reference sequences and 266 insertion sites obtained from the 454 ITS1 reads (Online Appendix C). Thus, 45 % of the PaPaRa alignment represented insertion sites. Of the 591 sites, positions 1–15 corresponded to the nuSSU primer and positions 505–591 to the highly conserved 5.8S region including the 5.8S primer. The insertion sites could be classified into three categories: (1) homopolymer extensions; (2) heteropolymer insertions; and (3) mixed sites which, depending on the base call in individual reads, can be interpreted as either homopolymer extensions or heteromer insertions (Online Appendix D). A total of 115 positions, five corresponding to the ITS5 forward primer and ten to the 5.8S region, were classified as homopolymer extensions; 47 sites represented heteropolymer insertions (one in the primer and seven in the 5.8S region); 52 sites were mixed sites which could not be unambiguously classified as either homopolymer extensions or heteropolymer insertions (one in the ITS5 forward primer and seven in the 5.8S region); and 53 positions had insertions in no more than two reads and were classified as random (one in the ITS5 forward primer and 15 in the 5.8S region). Since homopolymer extensions can be explained by incorrectly calibrated chemiluminescent sensor sensitivity (base overcall or undercall), we focused further analysis on heteropolymer insertions. Of the 47 heteropolymer insertion sites, 31 were present in ten or more reads, providing a sound basis for statistical analysis (Table 5). A further 13 were present in three or more (but less than ten) reads. All of these 44 insertion sites had either a single, uniform base call in all reads across all three samples and repeats (31 sites) or a strongly dominating base call with a single exception in no more than one or two reads (13 sites; Online Appendix D). This implies that the nature of the insertion was practically the same in all reads across all samples and repeats for a given site, independent of the emulsion PCR and sequencing parameters, as well as independent of the well position corresponding to individual reads. This pattern was found to be highly significantly different from random (Chi Square test; p < 0.001).

Table 5 Summary of 31 heteropolymer insertion sites present in ten or more reads, sorted in descending order by number of reads in which insertion was present

Heteropolymer insertions were most frequently found in repeat 1 (all three samples), with sample DIC150 having the highest proportion of reads with insertions present (Table 5). In most cases, the proportion of reads with insertions was significantly lower in repeats 2A/B, although the same insertions were found in these reads across all samples. In some cases, insertions were about equally distributed across reads from all samples and repeats (positions 174, 180, 259, 389, 424, 455), and in a few others, they were more frequent in reads from repeats 2A/B (positions 264, 297, 515, 547). Overall, we found 15,654 insertions in the 16,665 reads (0.94 per read on average), 8,066 of which corresponded to homopolymer extensions, and 3,437 to heteropolymer insertions; 4,053 corresponded to mixed sites (both homopolymer extensions and heteropolymer insertions) and 106 to random sites. Of the 3,437 heteropolymer insertions, 3,368 (98 %) corresponded to the 31 insertion sites present in ten ore more reads. Without exception, all of these occurred 1–3 bases after a homopolymer segment of the same base call, corresponding to partial deletions in the homopolymer segment and hence representing CaFIE errors. The remaining heteropolymer insertions, as well as those in mixed sites, showed the same patterns, even in those cases where insertions were equal in frequency between repeats or more frequent in repeats 2A/B (total of 263 out of 3,368). Thus, of the total of 15,654 insertions, up to 15,548 (99.3 %) could be attributed to either homopolymer extension or CaFIE errors.

In addition to insertions, we also detected 3,279 substitutions in 187 sites aligned to those present in the Sanger references sequences, corresponding to 0.06 % of all base calls. The number of reads with substitutions ranged from 1 to 806, with 184 sites having 31 or less reads with substitutions (less than 0.2 % of all reads) and three sites having a larger number of reads with substitutions (124, 129, 806). A total of 118 substitution sites had 2,839 substitutions. These were highly consistent across all reads for each position and apparently not artifacts, since 2,677 substitutions (94 %) exhibited exactly the same base call independent of read and run. Inspection of individual reads showed that they might in part represent masked CaFIE errors, incorrectly aligned with non-homologous sites in the Sanger reference sequences due to adjacent deletions; however, we found no significant differences in relative frequency of these substitutions among repeats, suggesting that they might also represent real biological substitutions. For example, the site with the largest number of reads with substitutions, 806 out of 16,665 (4.8 %) at position 125, invariably had G in the Sanger reference sequences and A in 805 out of 806 reads with substititions: 224 out of 4,278 reads in repeat 1 (5.2 %), 301 out of 6,091 reads in repeat 2A (4.9 %), and 280 out of 6,296 reads in repeat 2B (4.4 %). The next two positions had much fewer reads with substitutions (129 and 124, respectively, or 0.74–0.77 %), but with similarly consistent substitution patterns, replacing G with A (position 193) and A with T (position 256).

The number of OTUs present in the dataset of 16,665 reads was estimated between 3 and 5,945 depending on the threshold level, with 37 at 95 % (Table 6). After correcting the reads using PaPaRa (insertions automatically removed), the corresponding estimates were between 1 and 3,425, with 4 at 95 % (Table 6). Since, phylogenetic analysis demonstrated that all reads belonged to C. inversa, the application of the cluster analysis on the original reads resulted in a 11-fold, 35-fold, and 137-fold overestimation of biological diversity using the commonly applied threshold ID levels of 90, 95, and 97 %. For the corrected reads (insertions removed but substitutions retained), biological diversity was estimated correctly up to 80 % ID level, whereas at the 90, 95, and 97 % ID level, overestimation by the factor 2, 4, and 8 occurred (due to the presence of substitutions). This represented a 5–16-fold improvement compared to uncorrected reads.

Table 6 Number of assumed biological units derived from the complete set of 454 ITS1 reads of C. inversa resulting from the three samples and runs, using the original reads and reads after removal of insertions with PaPaRa 2.0 (Berger and Stamatakis 2011)

There was a highly significant linear correlation between read frequency and estimate of biological diversity at all threshold levels, with R = 0.86 at 90 %, R = 0.94 at 95 %, and R = 0.96 at 97–100 % ID threshold level (Table 7). For up to 16 reads, threshold levels of up to 97 % predicted the biological diversity correctly, but this value dropped substantially with higher read frequency (Fig. 2). Using 95 % threshold, 32–125 reads overestimated the biological diversity twofold, and at 90 %, more than 1,000 reads overestimated the biological diversity at least threefold. At 1,000 reads, only a threshold level of 80 % or lower predicted the biological diversity correctly.

Table 7 Spearman rank correlation between read frequency and estimate of biological diversity (see Fig. 2)
Fig. 2
figure 2

Read frequency (randomly selected out of total of 16,665 reads, with ten repeats per frequency level) and resulting estimate of biological diversity. Colors indicate level of correctness compare to real biological diversity (a single species) (Color figure online)

Discussion

Our findings of substantial amount of variation in 454 pyrosequencing data of the lichenized fungus C. inversa at first glance appears in line with an increasing number of NGS studies report intragenomic variation in the ITS and other portions of the nuclear rDNA cistron (Lindner et al. 2013). However, in contrast to other studies, we found no evidence for underlying biological causes for most of this variation. Instead, while potentially real biological substitutions accounted for only 0.06 % of all corresponding base calls, we attribute more than 99 % of the observed insertions to either homopolymer extension or CaFIE errors (Margulies et al. 2005; Rothberg and Leamon 2008; 454 Life Sciences Corporation 2009; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011).

Technically, CaFIE errors cannot be distinguished from biological variation by inspecting individual reads, since both represent systematic positional variation, with the same insertion consistently appearing at the same position in different reads. CaFIE errors thus mimic indel variation found in Sanger sequences between closely related species (Dal-Forno et al. 2013). Automated denoising methods (Quince et al. 2009, 2011; Caporaso et al. 2010a; Reeder and Knight 2010) cannot adequately detect these errors, since they focus on signal flow variation of homopolymer segments, but do not take into account associated heteropolymer insertions. Another problem is that CaFIE errors might be masked as substitution errors if reads that contain both insertions and deletions are not properly aligned; this might explain why some studies on sequence errors report very low proportions of indels versus substitutions (Keegan et al. 2012). The difficulty of assessing CaFIE errors might be the reason why previous authors considered them of little significance compared to other potential error sources (Margulies et al. 2005; Rothberg and Leamon 2008; 454 Life Sciences Corporation 2009; Balzer et al. 2011; Gilles et al. 2011; Vandenbroucke et al. 2011). While, CaFIE errors can be theoretically predicted by comparing heteropolymer sequence patterns immediately following homopolymer segments, the only way of actually detecting such errors is the comparison of separate sequencing runs of the same community PCR samples, which permits assessing frequency patterns: if the variation is largely biological, one would expect comparable proportions of positional variation in separate runs, as was here observed for substitutions and a small proportion of insertions. In addition, it is necessary to analyze a larger number of reads from a given sample and species and compare reads with Sanger sequences from the same samples, to accurately recognize erroneous insertion sites.

The previous studies using duplicated 454 pyrosequencing runs from the same samples found no significant differences in the allocation of sequences to taxonomic units, suggesting that at least part of the detected variation was biological and not artifactual (Kauserud et al. 2012; Kumar et al. 2011). However, these authors attempted to replicate the same sequencing conditions and used a clustering approach to detect biological units, rather than analyzing and comparing individual reads by means of multiple fixed alignments. Sequencing of environmental samples (O’Brien et al. 2005; Sogin et al. 2006; Geml et al. 2008; Taylor et al. 2008; Buée et al. 2009; Amend et al. 2010; Lumini et al. 2010; Wallander et al. 2010; Hibbett et al. 2011; Dai et al. 2012; McGuire et al. 2012) usually results in a low number of redundant read copies for each taxon present in the sample, which makes the detection of CaFIE errors difficult. In addition, clustering techniques will interpret CaFIE errors as real variation, especially if they use the longest reads as seeds, such as USearch (Edgar 2010, 2013), since the longest reads will likely have the largest number of erroneous insertions.

In the present case, automated clustering of reads representing a single species led to a 35-fold overestimation of biological diversity at the commonly applied threshold ID level of 0.95. This finding is alarming, since automated clustering will invariably interpret all variation present in a sample as biological. The problem is especially acute for studies of environmental samples, where species are not known and cannot be referenced against the existing Sanger sequences. Unfortunately, clustering techniques are indispensable to pre-digest large amounts of data (Li and Godzik 2006; Schloss et al. 2009; Caporaso et al. 2010a; Edgar 2010, 2013; Huang et al. 2010; Kumar et al. 2011), especially if the target gene is highly variable and unalignable across the range of analyzed organisms, such as ITS. The problem of overestimation of biological diversity using clustering methods has been discussed in several studies (Quince et al. 2009; Engelbrektson et al. 2010; Kunin et al. 2010; Powell et al. 2011; Unterseher et al. 2011; Zhou et al. 2011), but the error is usually believed not to exceed a factor of ten, and is often considered to be much lower, contrasting with the results presented here. The overestimation of biological diversity through clustering increases with read frequency per taxon; hence, species represented by few reads in a sample are likely to be correctly recognized, whereas species with a large number of reads tend to be interpreted as more than one taxon. Our analysis suggests that up to 16 reads per taxon will predict biological diversity correctly up to a relatively high threshold level of 97 %, but accuracy rapidly declines in linear fashion with increasing read frequency (cluster size) per taxon. Hence, this problem will have low impact in datasets with a rather balanced distribution of (few) reads per taxon, but potentially high impact in datasets which contain few to many dominant taxa with large numbers of reads each. Without removal of CaFIE errors, the only way to accurately analyze such data is to employ phylogenetic approaches instead of clustering (Porter and Golding 2011; Powell et al. 2011), including alignment tools such as PyNAST and PaPaRa (Caporaso et al. 2010b; Berger and Stamatakis 2011). Our results also imply that, contrary to what has been suggested in other studies (Kunin et al. 2010; Tedersoo et al. 2010; Unterseher et al. 2011), singleton reads should not be excluded as potential artifacts but instead contribute substantially to accurate estimates of taxonomic diversity, since sequencing errors in singletons have no adverse effect in diversity estimates based on clustering methods. Thus, while the previous studies suggested to exclude singletons and focus on clusters to estimate diversity, we conclude that singletons should by default considered to represent real biological diversity, whereas clusters should be carefully examined for effects of sequencing errors by analyzing them with alignment-based phylogenetic methods.

The huge amount of NGS data to be dealt with poses a challenge to phylogenetic approaches, but this can be surmounted by using clustering and blasting techniques to pre-digest the data and then applying alignment-based phylogenetic analysis on portions of the data to determine biological units. Phylogenetic analysis using maximum likelihood can now be applied to alignments containing thousands of sequences by using RAxML on public multicore servers (Stamatakis 2006; Stamatakis et al. 2005, 2008; Miller et al. 2010). The alternative approach through removal of sequencing errors using PaPaRa prior to clustering considerably improves the results of clustering compared to original reads; yet, clustering will always only provide a proxy for real diversity, since a predefined threshold ID level has to be applied to make data comparable, whereas a phylogenetic approach is independent of preset ID levels. Phylogenetic methods that provide simultaneous search options for the best alignment-tree combination, such as POY, BaliPhy, and SATé (Suchard and Redelings 2006; Liu et al. 2009, 2012; Varón et al. 2010), are potentially valuable methods to analyze widely divergent sequences in a sample without the necessity of a multiple fixed alignment (Wang et al. 2011), but these methods are computationally expensive and currently cannot handle very large datasets of tens to hundreds of thousands of reads as the result of typical 454 pyrosequencing runs.

The interpretation of most of the variation found here as 454 pyrosequencing errors, with only a small proportion of substitutions apparently reflecting biological variation, is in accordance with studies that reported intragenomic ITS as comparatively uniform (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Ganley and Scott 2002; Ganley and Kobayashi 2007). Such uniformity is assumed to be due either to concerted evolution (Liao 1999, 2008) or to a birth-and-death mechanism with purifying selection (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Ganley and Scott 2002; Rooney and Ward 2005; Ganley and Kobayashi 2011). Other studies suggest intragenomic variation in fungal ITS and the nuclear rDNA (Hughes and Petersen 2001; Ko and Jung 2002; Gomes et al. 2002; Pawlowska 2005, 2007, 2010; Rooney and Ward 2005; Smith et al. 2007; Simon and Weiss 2008; Nilsson et al. 2008; James et al. 2009; Lindner and Banik 2011; Kovács et al. 2011a, b; Li et al. 2013; Lindner et al. 2013). In a large-scale analysis of Sanger ITS sequences of the lichenized fungal family Physciaceae, Lücking et al. (2012) found evidence for rDNA (ITS) pseudogenes in a single clade, whereas all other clades were consistent with a single, functional rDNA cistron. Therefore, the lack of evidence for intragenomic variation in some studies and its presence in others suggests that pseudogene evolution is specific to individual lineages, even within the same, larger clades. Thus, the existence of pseudogenes will have to be analyzed for each clade separately, which poses a challenge for the use of ITS as fungal barcoding locus (Hebert et al. 2003; Schoch et al. 2012), since pseudogenes can form novel clades that mimic cryptic speciation. Our method would be a viable approach to resolve this problem for any given clade.

Quantitative analysis of 454 pyrosequencing data for single clades or taxa can also shed light on the nature of potential variation and how such variation might affect phylogenetic studies using ITS generated from Sanger sequencing as barcoding locus. In the present case, assuming that substitutions (not indels) detected in the dataset reflect real biological variation, their very low proportion (0.06 % of all base calls) will cause them to disappear as background signal in Sanger sequencing. This is supported by the Sanger sequences obtained from the same samples used in this study, which do not contain any variation. Sanger sequencing thus acts like a filter, which is of advantage when defining phylogeny-based taxa, but suggests genomic uniformity when in fact some level of intragenomic variation exists. The low level of variation found here would suggest that a small portion of the rDNA (ITS) copies in the genome of C. inversa "escaped" concerted evolution. However, under a model of concerted evolution or birth-and-death mechanism with purifying selection (Morzycka-Wroblewska et al. 1985; Borsuk et al. 1988; Liao 1999, 2008; Ganley and Scott 2002; Rooney and Ward 2005; Ganley and Kobayashi 2007; Ganley and Kobayashi 2011), one would expect that, at any given time, a small proportion of the gene copies to deviate from a uniform pattern. Hence, small amounts of variation do not contradict, but are consistent with, concerted evolution.