Introduction

A fast, inexpensive and accurate method to measure microbial diversity would be a welcome addition to the toolbox of microbial ecology. Fingerprinting techniques are still considered to be good candidates, but their lack of quantifiability is generally viewed as a major obstacle. A recent paper in this journal by Lalande et al. [1] sheds new light on this problem. Building on previous simulation studies [2, 3], they propose a quantitative framework to analyze the estimation properties of diversity metrics from fingerprints.

The framework of Lalande et al. [1] consists of three steps. First, the fingerprinting profile is analyzed: peaks are detected, the area under the peaks is determined and the background, that is, the area under the profile not attributed to any of the peaks, is computed. Peak areas are assumed to represent the abundance of the dominant phylotypes of the community. The background area is assumed to be equal to the total abundance of the other, rare phylotypes.

Second, the community abundance distribution is reconstructed starting from the abundance of the dominant phylotypes. This extrapolation step requires an assumption about the abundance distribution of the rare phylotypes. Different assumptions lead to different reconstructed communities, which may be very dissimilar to the true (and unknown) community. Lalande et al. [1] consider a set of reconstructed communities to quantify the effect of the assumed abundance distribution.

Third, phylotype richness, Shannon diversity and Simpson diversity are computed for each of the reconstructed communities. Each of the computed diversity values is an estimate of the true community diversity. If for a diversity metric the range of estimates is wide, then the estimation depends strongly on the assumed abundance distribution, and the diversity metric cannot be estimated accurately. A narrow range of estimates indicates that the diversity metric can be estimated accurately. In that case, the range of estimates can be interpreted as a measure of the accuracy with which the diversity metric can be estimated.

Lalande et al. [1] applied this framework to nine in silico generated fingerprints. They obtained narrow estimation ranges of the order of ±10 % both for Shannon and Simpson diversities and somewhat wider ranges for phylotype richness. These findings lead to the conclusion that accurate diversity estimation from fingerprints is possible, especially for Shannon and Simpson diversities.

In our opinion, the framework proposed by Lalande et al. [1] is a valuable contribution for evaluating the accuracy of diversity estimation from fingerprints. We note that a similar framework was introduced recently to assess diversity estimation from metagenomic data sets [4]. However, we argue that the framework can yield stronger conclusions than those presented in Ref. [1]. In particular, by considering a larger set of reconstructed communities, we show that the estimation range for phylotype richness and for Shannon diversity becomes very wide. Only for Simpson diversity that we find a narrow estimation range. Hence, we are left to conclude that only Simpson diversity can be estimated accurately from fingerprints. This stands in sharp contrast to the conclusions of Ref. [1].

To make our argument, we consider in Fig. 1 the data set used in Fig. 1 of Ref. [1]. The left-hand panel uses the same axis scaling as Ref. [1] (linear on x-axis, logarithmic on y-axis). The right-hand panel is identical to the left-hand panel except that we use double logarithmic scaling, which is more convenient for our purpose. The fingerprint peak areas are represented as × marks. Recall that these areas are assumed to be equal to the abundance of the dominant phylotypes in the community. The black line represents the rank-abundance curve of the data set from which the fingerprint was generated, but which is unavailable for the diversity estimation from the fingerprinting profile.

Fig. 1
figure 1

A variety of reconstructed communities is consistent with a fingerprinting profile. Rank-abundance curves are shown for the dominant phylotypes obtained from the fingerprint (black × marks, to the left of the dashed line), for four reconstructed communities of the rare phylotypes (red, yellow, green and blue lines to the right of the dashed line) and for the “true” community from which the fingerprint was generated (black line). The two panels are identical, except that in the left-hand panel, the scale of the x-axis is linear, whereas in the right-hand panel, the scale of the x-axis is logarithmic

Four community reconstructions (see Appendix for details) are shown as colored lines (red, yellow, green and blue). Each of these reconstructed communities is consistent with the fingerprinting data. That is, if one would generate a fingerprint of the reconstructed communities, one would get fingerprints that are very similar to the fingerprint we are analyzing. In other words, fingerpinting cannot tell the difference between these four communities. Nevertheless, the structure of these four communities is very different, as shown by their rank-abundance curve.

The yellow community is a realistic reconstruction because it is close to the data set from which the fingerprint was generated (compare yellow and black lines in Fig. 1). The other three communities have a qualitatively similar structure, but very different phylotype richness. Are these numbers, such as 106 phylotypes for the blue community, realistic? In fact, there is no reason to consider them to be unrealistic. Even large metagenomic data sets are not sufficiently informative to rule out such large numbers of phylotypes, as argued in Ref. [4]. Therefore, when analyzing diversity estimation from fingerprints, we should also take into account more extreme reconstructions such as the blue community.

The difference between the reconstructed communities has important consequences for the diversity estimation problem (see Fig. 2). For each reconstructed community, we plot phylotype richness, Shannon diversity and Simpson diversity. Recall that these values are interpreted as possible diversity estimates. The range of estimates for phylotype richness (from 103 to 106) and for Shannon diversity (from 700 to 2 104) is very wide. This implies that phylotype richness and Shannon diversity cannot be estimated accurately. The range of estimates for Simpson diversity is narrow (from 410 to 530 or 470±13 %), implying that Simpson diversity can be estimated accurately.

Fig. 2
figure 2

Estimation ranges differ greatly between diversity metrics. For the four reconstructed communities of Fig. 1 (red, yellow, green and blue), we plot phylotype richness (first column), Shannon diversity (second column) and Simpson diversity (third column). The three diversity metrics are expressed as effective numbers of phylotypes. The grey-shaded regions indicate the range of diversity estimates consistent with the fingerprint (see Appendix for details)

The above analysis is based on only four reconstructed communities. How sensitive are the results to the choice of these communities? To answer this question, we propose a general analysis in which we take into account all communities consistent with the fingerprint (see Appendix for details). We determine lower and upper bounds for the estimation ranges of the three diversity metrics (shown as grey-shaded regions in Fig. 2). As before, we find a narrow estimation range for Simpson diversity only, confirming that Simpson diversity, but neither phylotype richness nor Shannon diversity can be estimated accurately from fingerprints. Interestingly, a similar analysis for metagenomics data sets indicated that both Shannon and Simpson diversities, but not phylotype richness, can be estimated accurately [4].

To summarize, we have shown that the theoretical framework of Lalande et al. [1] can be extended to reach the following conclusions: (1) phylotype richness and Shannon diversity cannot be estimated accurately from fingerprinting profiles and (2) Simpson diversity can be estimated with an accuracy of the order of 10 %. These conclusions should be relevant for various fingerprinting techniques, such as denaturing gradient gel electrophoresis (DGGE), single-strand conformation polymorphism (SSCP), ribosomal intergenic spacer analysis (RISA) and terminal restriction fragment length polymorphism (T-RFLP).