Introduction

Fungi are important in soils as both decomposers and plant symbionts. Traditional surveys based on macroscopic or microscopic features, such as fruit body surveys, microscopy of plant roots or isolation techniques, despite considerable progress, have been insufficient to describe fungal communities inhabiting soil environments. Molecular methods have recently greatly overcome these limitations to allow detection of unculturable community members. Since its first applications in 2009 (Buéé et al. 2009; Jumpponen and Jones 2009), amplicon pyrosequencing studies have focused on the diversity of fungal communities (e.g. Buéé et al. 2009; Jumpponen and Jones 2009; Öpik et al. 2009), the activity of fungal communities (Baldrian et al. 2012; Štursová et al. 2012) or functional genes (e.g. Baldrian et al. 2013; Voříšková and Baldrian 2013) of both total fungi and specific groups like the Glomeromycota (e.g. Dumbrell et al. 2011; Lekberg et al. 2012; Öpik et al. 2009). Pyrosequencing has become the method of choice for the in-depth analysis of fungal community composition.

Data accumulate with increasing numbers of studies, but the experimental approaches for data collection and analysis widely differ. This unfortunately greatly limits our ability to compare among studies and draw general conclusions regarding important questions such as estimating community diversity, evenness and composition or identifying important taxa. Data analysis appears to be an important area where further improvements and unification of experimental procedures are necessary. Past experience derived from published studies indicates which steps are important and should be considered when designing data analysis workflows.

Most importantly, the complexity of the data and the specifics of the methods may cause several biases that affect the quality of the resulting sequence dataset and any subsequent statistical analyses or ecological considerations. These include pyrosequencing-specific errors, sometimes termed “sequencing noise” (Quince et al. 2009, 2011), unclear quality of low abundance sequences, the presence of chimeric sequences (Taylor and Houston 2011; Tedersoo et al. 2010) and PCR target-associated biases (Bellemain et al. 2010; Krüger et al. 2012).

Despite the development of alternative sequencing platforms (Shokralla et al. 2012), pyrosequencing will likely remain widely used in the near future. For this reason, we believe that the standardisation of methods for fungal community analysis is highly desirable because it will soon allow us to exploit the wealth of individual studies to deliver general statements regarding fungal diversity, biogeography or ecology.

Standards of data reporting that include information regarding the sampling site and its corresponding metadata, laboratory processing steps and data analysis were previously suggested (Nilsson et al. 2011). The aim of this paper was to describe the data analysis procedures previously used, indicate the limiting steps and suggest a simple data analysis workflow that can avoid potential problems. Because the processing of large-scale pyrosequencing-derived data may represent a methodological limitation, a newly developed software pipeline, SEED, is introduced in this paper that allows researchers to perform the required data analysis steps with a single, easy-to-use user interface.

Materials and methods

Meta-analysis of studies using amplicon pyrosequencing to explore fungal communities

Scientific publications using amplicon pyrosequencing to analyse fungal communities were retrieved. The source of sample material, molecular target (gene and primer pair) and number of sequences that were used for community analysis were recorded. With respect to the experimental methodology used for sequence processing, the minimum sequence length and the presence or absence of sequence processing steps (removal of pyrosequencing noise, removal of chimeric sequences, creation of similarity clusters, diversity analysis and sequence annotation) were recorded. The bioinformatic tools used for data cleanup, sequence clustering and annotation were also recorded (Table 1). The data retrieved from publications were used to analyse the approaches used in fungal community amplicon pyrosequencing.

Table 1 Overview of studies using amplicon-based 454 pyrosequencing to analyse fungal communities along with important parameters of data processing

Development of pipeline to analyse sequences obtained by amplicon pyrosequencing

Based on the previously applied approaches to amplicon pyrosequencing data analysis, the necessary steps were identified and the analysis workflow was proposed. The development of the optimized workflow was based on both the available knowledge from previous papers about the effects of certain data analysis steps on the resulting dataset quality (Schloss et al. 2009, Edgar et al. 2011) and on our own analysis of a sample dataset. For this purpose, the publicly available dataset deposited in MG Rast 4497081.3 that contains sequences of fungal internal transcribed spacer (ITS) region from oak leaves at different stages of decomposition (Voříšková and Baldrian 2013) was used. The aim was to analyse the effects of certain data analysis steps on the fungal diversity estimates and identification of operational taxonomic units (OTUs; defined as sequences clustered at a 97 % probability level). Specifically, we analysed (1) for each sample (n = 21, 1129 sequences per sample) the effects of clustering sequences of original length (380–560 bases) versus sequences truncated to the same length on OTU richness, the Chao estimate and the number of singletons; (2) for each sample the effects of chimera removal on OTU richness, Chao and singletons; and (3) for the 150 most abundant OTUs, the quality of OTU identification was compared with OTUs represented either by random sequences or consensus sequence. The quality of identification was defined as the similarity of the query sequence and the most similar Sanger sequencing-derived sequence deposited in GenBank (http://www.ncbi.nlm.nih.gov/genbank/). This is based on the assumption that sequences containing errors are less similar to real sequences and that consensus construction should correct random errors in sequences. Sequence clustering and chimera removal were performed using default Usearch and Uchime settings (Edgar 2010, Edgar et al. 2011), and nucleotide BLAST (Altschul et al. 1990) was used to retrieve the closest hits from GenBank. Wilcoxon pair test was used to analyse the differences among dataset pairs. Differences at P < 0.01 were regarded as statistically significant. The optimized data analysis workflow was used in the course of the development of a user-friendly data analysis pipeline.

The pipeline, SEED (http://www.biomed.cas.cz/mbu/lbwrf/seed/main.html), was created that enables users to perform the entire bioinformatic analysis of PCR amplicons according to the suggested workflow. The same pipeline was also used to perform all workflow testing steps outlined above. The functionality of the pipeline was tested with datasets from previous pyrosequencing projects with amplicon sequences for the fungal ITS region, bacterial 16S rDNA and the fungal cbhI exocellulase gene (Baldrian et al. 2012; Štursová et al. 2012; Větrovský and Baldrian 2013; Voříšková and Baldrian 2013). In the last paper, the data analysis workflow recommended here was used for data processing.

The SEED pipeline is a workbench that runs in the Microsoft Windows environment with internal functions and functions performed by external programmes that must be installed for full functionality. The removal of pyrosequencing noise is performed using Pat Schloss’s translation of Chris Quince’s PyroNoise algorithm implemented within the Mothur package (Schloss et al. 2009). The removal of chimeras created during PCR amplification is performed using Uchime (Edgar et al. 2011), and Usearch (Edgar 2010) is used for sequence clustering. Sequence alignment is performed by calling MAFFT (Katoh et al. 2009), and BLAST searching and the creation of local databases are dependent on the National Center for Biotechnology Information (NCBI) tools (http://www.ncbi.nlm.nih.gov/; Altschul et al. (1990)). Internet connection is required for searching online databases, e.g. the NCBI nucleotide database.

The SEED pipeline is freely available for non-commercial use and can be downloaded along with documentation from the SEED project webpage: http://www.biomed.cas.cz/mbu/lbwrf/seed/main.html. The installation of external programmes may require the consent of their authors: more information can be found at the web pages of these projects, accessible by hyperlink from the above address.

Results

In total, 42 published studies were analysed (Table 1). The number of papers using amplicon pyrosequencing to analyse fungal communities increased rapidly from 3 in 2009 to 22 in 2012. Soil fungal communities were the most common target of amplicon pyrosequencing (21 studies), along with fungal communities in plant roots (12 papers). Other environments (sediments, aboveground plant tissues, corals or wood) were only rarely addressed. Although most studies were designed to cover the entire fungal community, six papers targeted specifically arbuscular mycorrhizal fungi. In addition to analysing entire fungal communities, amplicon sequencing was also applied to analyse the diversity of the fungal cbhI exocellulase gene, a proxy for the community of cellulose-decomposing fungi (Baldrian et al. 2012; Štursová et al. 2012; Voříšková and Baldrian 2013). There is only one single study to date in which RNA-derived amplicons were used to specifically analyse metabolically active fungal taxa (Baldrian et al. 2012; Purahong and Krüger 2012).

The ITS region was by far the most frequently analysed region of fungal rDNA: only four and three papers analysed various regions of the 18S and 28S rRNA genes, respectively (Fig. 1). Within the ITS, ITS1 was mainly targeted with several primer pairs to amplify only this region; in additional studies, both the ITS1 and ITS2 regions were amplified, but because of the limiting lengths of pyrosequencing-derived sequences and the fact that sequencing mostly occurred from primers within the 18S, the sequence data also covered predominantly the ITS1 region. Only recently, studies analysing the ITS2 region specifically have been conducted (Davey et al. 2012; Hartmann et al. 2012; Ihrmark et al. 2012; Menkis et al. 2012).

Fig. 1
figure 1

Primers and PCR amplicons used in amplicon pyrosequencing analyses of the community composition of general fungi. The thickness of grey bars indicates the number of studies using the respective amplicons. Numbers indicate the positions in the rDNA of Fusarium oxysporum

The initial steps of sequence data processing typically consisted of sequence quality filtering and reduction of PCR or sequencing errors. A wide set of tools was used for data cleanup, which resulted in the removal of sequences of insufficient length or quality, but the minimal length of sequences retained in the cleaned dataset varied considerably (Table 1 and Fig. 2). Typically, between 10 and 40 % of sequences were removed in this step. Pyrosequencing-derived errors, typically the variable lengths of longer homopolymer regions, were corrected by clustering pyrosequencing flowgrams, termed “denoising”, and PCR-derived errors were removed by chimera-cleaning tools. Despite the high rate of occurrence of both types of errors, only <30 % of all studies used one of these approaches and only 12 % used both (Fig. 2).

Fig. 2
figure 2

Overview of approaches used to analyse sequences derived by amplicon pyrosequencing of fungal rDNA based on 42 recently published studies

Sequences that passed filtering steps were used to create virtual taxa, i.e. the sequence similarity clusters most often termed operational taxonomic units. Despite the inconsistency of clustering sequences of variable lengths, only a handful of studies truncated sequences to identical lengths or extracted particular DNA regions before clustering. CD-HIT, BLASTCLUST and CAP3 were most frequently used for clustering. For annotation, OTUs were represented either by a randomly selected sequence or by the longest or most abundant sequence. In six studies, consensus sequences were constructed to represent OTUs (Table 1). Approximately one half of the studies only considered non-singleton sequences for community analysis (Fig. 2), and BLAST against the NCBI database was the most frequent approach to assign taxonomic identity to OTUs. In 55 % of studies, diversity parameters were calculated for individual samples. Among these, only 26% performed resampling to the same depth before calculating diversity (Fig. 2).

After considering the previous data analysis protocols, we suggest the following workflow (Table 2). Quality trimming should first exclude sequences of low base quality and length. The minimal length of sequences to be analysed should be at least above 150 bases because both the ITS1 and ITS2 are longer than that for many fungi. The quality of taxonomic assignments based on the 18S or 28S region analyses also greatly increases with sequence length. Both denoising and chimera removal should be performed to reduce the sequence error rate to a minimum. Because clustering algorithms compare sequences in a pairwise manner, the regions to be clustered should optimally be defined as the same DNA region, i.e. with defined primer positions at both ends (if the amplicons are shorter than pyrosequencing read length), or using a defined sequence (e.g. ITS1, ITS2, ITS1 + 5.8S + ITS2) that can be extracted easily (Nilsson et al. 2010), or at least defined by the same length of all sequences. Consensus sequences best represent individual sequences within an OTU.

Table 2 Workflow of the analysis of sequences derived by amplicon pyrosequencing of fungal communities

Depending on the aim of the study, sequence identification may be requested either based on the identity of the closest database hit or through multiple alignment of OTU sequences with known sequences. In the studies targeting the diversity of fungal communities, community richness, evenness or other parameters may also be derived. To obtain comparable data, the sequence database has to be randomly resampled to obtain identical numbers of sequences from each sample.

Clustering of sequences truncated at 380 bases gave lower OTU counts, numbers of singletons and Chao estimates of total community richness than the clustering of sequences of their original lengths of 380–560 bases. The numbers of OTUs, singletons and Chao estimates were lower by 13.4 ± 1.6, 12.7 ± 2.4 and 7.4 ± 3.3 %, respectively, all differences being statistically significant at P < 0.003. This shows that the OTU counts are inflated when sequences of different lengths are clustered together. The application of chimera removal on sequences truncated to 380 bases decreased the numbers of OTUs, singletons and Chao estimates further by 20.1 ± 1.2, 16.7 ± 2.0 and 17.5 ± 3.4 %, respectively, all differences being statistically significant at P < 0.001. This shows that a significant part of the apparent diversity in the dataset may be due to the presence of chimeric sequences. Consensus sequences of the 150 most abundant OTUs in the dataset showed significantly higher (P < 0.0001) sequence similarity to the closest BLAST hit in GenBank than random sequences, with 69 % consensus sequences showing higher similarity, 27 % showing the same similarity and 4 % showing lower similarity. Moreover, 13 % OTU consensus sequences showed 100 % similarity to the GenBank sequence, whilst the corresponding random sequences were less similar. On average, consensus sequences were by 0.29 ± 0.04 % more similar to the closest GenBank hits than randomly selected sequences.

The SEED pipeline makes it possible to perform all steps of the sequence analysis workflow from a single, user-friendly interface (Fig. 3). The features of the pipeline are summarised in Table 3, and more information can be found on the project webpage (http://www.biomed.cas.cz/mbu/lbwrf/seed/main.html) that contains full documentation of the functions and a step-by-step introduction to the data processing workflow. Importantly, in addition to sequence grouping, SEED makes it possible to perform batch operations with groups, such as chimera removal from individual samples, calculation of consensus sequences for individual OTUs, resampling of all samples at a specific depth, etc. SEED can be used to analyse PCR amplicons of any type, e.g. bacterial 16S rDNA or functional genes, or to analyse gene sequences obtained by other means (e.g. batch download from the NCBI nucleotide or genome database).

Fig. 3
figure 3

Screenshot of the amplicon pyrosequencing pipeline SEED

Table 3 Features of the amplicon pyrosequencing pipeline SEED

Discussion

The methods of next-generation sequencing have revolutionised microbial ecology, allowing researchers to explore complex communities at unprecedented depths. Despite the first applications of the Illumina (Caporaso et al. 2012) or Ion Torrent (Whiteley et al. 2012) technologies to explore bacterial communities, pyrosequencing remains the method of choice for fungal and bacterial amplicon sequencing, offering the advantages of reasonable sequence length, easy multiplexing and sufficient sequencing depth for most studies (Glenn 2011). Nevertheless, successful applications of pyrosequencing approaches are dependent on a number of methodological considerations, including sampling strategies and metadata collection, the choice of suitable molecular marker and approaches for data analysis. Because of the diversity of all of the above methodologies in the published studies, it is extremely difficult to use the wealth of information derived by pyrosequencing for inter-study comparisons or meta-studies. Furthermore, published papers differ widely in the level of method descriptions and data availability. We strongly agree with the previous paper by Nilsson et al. (2011) in that full description of the experimental procedures and public data availability should be a standard.

Here we show that despite some general preferences, many different molecular targets are used to study both general fungi and arbuscular mycorrhizal fungi. Without exception, fungal rDNA was targeted despite widely varying relationships between its copy numbers and fungal cell counts or biomasses (Amend et al. 2010a; Baldrian et al. 2013). The ITS region amplified using various sets of primers was the preferred target, consistent with the dominant current opinion (Schoch et al. 2012).

Although ITS1 was frequently sequenced, it is notable that the results obtained with various primers cannot be easily compared because of their variable coverage of the fungal tree of life (Anderson et al. 2003). Unfortunately, there are only a few papers in which various primers were compared. The recent paper by Ihrmark et al. (2012) demonstrates that PCR amplification can be highly uneven among primer pairs as well as diversity estimates. More work is still required in this direction.

The data analysis procedures used in past amplicon pyrosequencing studies indicate many potential limitations of data quality. Studies using sequences of <150 bases length covered less than the entire ITS1 or ITS2 regions of certain fungi because of the differences in the regions’ lengths, and this seems to be unsuitable. In our in silico study considering the region between the ITS1/ITS4 primers, fungal sequence assignment quality increased with increasing sequence length up to the length of 350–380 bases (data not shown). Such sequence lengths are easily available with current technologies and may be desirable when reliable OTU classification is required. Furthermore, clustering algorithms work best with sequences of identical boundaries (or lengths), a fact that is usually not considered. Here, we show that clustering of sequences of uneven length significantly increases the diversity estimates.

PCR and pyrosequencing have been shown to cause method-dependent sequencing errors (Quince et al. 2009; Tedersoo et al. 2010). In PCR amplification, chimeric sequences are formed with frequencies at or above 3 %, depending on the number of cycles (Taylor and Houston 2011). Because these sequences are most often singletons, the presence of chimeric sequences may result in an overestimation of diversity. This was also clearly demonstrated here in the comparison of diversity estimates among the original and chimera-cleaned dataset. Chimera-cleaning procedures should therefore always be applied. When choosing a minimal length, one should also consider that the probability of detecting chimeric sequences rapidly increases with sequence length, and shorter sequences are more likely to contain undiscovered chimeras. In addition, the increase of sequence error counts associated with increasing sequence lengths and the frequency of sequencing errors in homopolymeric regions that stem from the techniques of pyrosequencing should be reduced by applying denoising (i.e. error correction) procedures (Quince et al. 2011). Unfortunately, error-correcting procedures have been rarely applied so far. Given the error rate of pyrosequencing-derived reads and the random distribution of such errors, the creation of OTU consensus sequences should further improve the representation of an OTU. This was demonstrated here by the fact that the consensus sequences are significantly more similar to the Sanger sequences deposited in GenBank than individual OTU sequences.

To explore fungal diversity, the analysis of identical numbers of sequences from all samples is essential because diversity estimates always scale up with sampling depth. This fact has also been frequently neglected in past studies.

We here outline a workflow of data analysis that aims to reflect all of the considerations required for obtaining high-quality data for community analysis and offer the SEED pipeline to accomplish this task. We hope that the unification of data analysis procedures represents an important step towards better comparability of individual studies and justification of their conclusions. The SEED pipeline should offer ecologists a tool that is easy to use, even for those with no preliminary experience with amplicon pyrosequencing, the method that will likely continue to dominate microbial community analysis in the coming years.