Introduction

Extrapolating global estimates of tropical arthropod species richness from samples, as first proposed by Erwin (1982) and revisited by many since (e.g., Thomas 1990; Stork 1988, 1993; Ødegaard 2000; Novotný et al. 2002), is an intriguing exercise because it potentially offers a significant short-cut that would save having to count species one by one, but at the same time it is vulnerable to producing massively misleading estimates, owing ultimately to the need to base extrapolations on host specificity measurements made for a minute proportion of all tropical tree species. The models are typically based upon a sample of beetle species collected from one or several tree species. This is because beetles are the most common taxon, accounting for about 25 and 40 % of all described insects and species, respectively (Hammond 1992; Yeates et al. 2003). Then, by making assumptions about host specificity to trees, the number of tropical tree species in the world, the proportions of species in the canopy and ground, and the proportion of all arthropods that are beetles, one can estimate how many tropical arthropod species might exist.

The model described above is a model of mean behaviour, that is, the average state of one parameter (species richness) of a far more complicated system over time. An individual-based model, where individual species, and even individual insects, are represented as discrete units would plainly be an insurmountable undertaking. For example, in this mean-behaviour model, the host specificity is a single parameter, but in an individual-based model it could require consideration of such things as the number of individual trees per tree species and hence the size of the populations per beetle species, the evolutionary life time of individual tree species, the number of months during which each tree carries leaves to be eaten by phytophages, the number of closely related tree species that might serve as a pool of phytophagous species to colonise a focal tree species, and the niche breadth and intraspecific differentiation of the tree species.

Until recently, all such extrapolation mean-behaviour models were purely deterministic; that is, despite the considerable uncertainties associated with the various parameters, no attempts were made to account for these. To this end, we recently published a probabilistic model (Hamilton et al. 2010, 2011), which has been seen as a significant step forward because it was the first attempt to explicitly deal with uncertainties in the extrapolation process (May 2010). In line with previous models, the model took the following form:

$$ N_{\text{Ai}} = \left( {xc/p_{\text{cg}} p_{\text{ba}} } \right)n_{\text{t}} , $$
(1)

where N Ai is the estimator of the number of tropical arthropod species under the assumption of independence between variables, x is the average effective specialisation (May 1990) of herbivorous beetle species across all tree species, c is a correction factor for non-herbivorous beetle species, p ba is the proportion of canopy arthropod species that are beetles, p cg is the proportion of all arthropod species found in the canopy, and n t is the number of tropical tree species. Note the change in notation for p ba and p cg from the original model; this was done because in retrospect the original notation was potentially ambiguous and confusing (see Hamilton et al. 2010, 2011). Probability distributions were assigned to all parameters.

Implementation of our original model was achieved using Latin Hypercube Sampling (LHS), a specialised form of Monte Carlo simulation wherein probability distributions are sampled in a stratified random manner (McKay et al. 1979). As with any modelling technique, Monte Carlo simulation necessitates assumptions. Thus, while this was the first attempt to account for uncertainty, the model (1) made certain assumptions about distributional form used to represent uncertainty and (2) did not consider potential dependencies between variables.

Before considering the relevance of assumptions about distributional form, we need to appreciate the fundamental nature of uncertainty. While various taxonomies of uncertainty have been proposed (Kahneman and Tversky 1982; Morgan and Henrion 1990; Regan et al. 2002), there are in essence only two basic forms—variability and ignorance (Casti 1990; Benke et al. 2007). Variability represents natural randomness or stochasticity and cannot be reduced, and is often called aleatory uncertainty. Ignorance, on the other hand, is reducible and arises from numerous factors, including, inter alia, measurement error, lack of data and small sample sizes, and personal biases, and is also known as epistemic uncertainty. Theoretically, different methods are required to propagate ignorance and variability (Ferson and Ginzburg 1996). This can be attempted in the Monte Carlo framework using a technique known as Second-Order Monte Carlo, wherein variability is represented by a probability distribution and ignorance is characterised in an outer-loop in one of a number of ways, such as alternative model scenarios or distributional shapes (Vose 2000). However, as pointed out by Regan et al. (2004), Second-Order Monte Carlo still requires a subjective assessment of the realistic range of input distributions. In fact, the process is innately contradictory, because the greater the ignorance, the more data are required to specify bounds on the distribution. Of course, a greater amount of data should lead to narrower bounds for the variable’s distribution.

This conundrum often leaves the Monte Carlo analyst with little choice but to construct a one-dimensional model, wherein variability and ignorance are confounded in simple distributions, as we noted in our original paper. For example, very little information existed for variable c in our model, a correction factor for non-herbivorous beetle species. Ødegaard (2000) identified seven different studies relevant to the determination of c. There will of course be true natural variability associated with c, as it would be unreasonable to expect that the ratio of herbivorous to non-herbivorous beetle species would be constant across all tree species throughout the tropics and at all tropical locations. Likewise, ignorance emerges from the facts that the handful of studies used to estimate c use different methods, are all subject to various limitations associated with sampling arthropod faunas, and all have associated biases inherent with site selection (i.e., c would ideally be determined from studies of randomly selected tree species at randomly selected sites across the tropics if their sole purpose was to contribute to the estimation of this variable for this model).

With such limited information available, it was clearly not possible to propagate variability and ignorance separately for c or the other variables in the model (Hamilton et al. 2010). Rather, the approach taken was to consider them together using Uniform distributions. The rationale for using the Uniform distribution to represent highly uncertain environmental variables is that it is the most conservative approach (e.g., Brook et al. 2003; Mara et al. 2007). This makes intuitive sense because we have no reason to favour the selection of any value in the range over another. Upon closer inspection though, the Uniform actually makes some potentially significant assumptions about a variable. Consider c again, which covers the interval [1.79, 2.70]. The Cumulative Distribution Function (CDF) of a Uniform distribution for this interval is a perfectly linear monotonically increasing function where the mean = mode = median = 2.25. In theory, an infinite number of distributions could describe this interval, and these would be bounded within a box defined by two vertical lines extending from 0 to 100 % cumulative probability at the minimum and maximum values. But there can be only one true distribution representing variability for this interval, yet its form is unknown to us, and this ignorance needs to be expressed through allowing variation in shape. Using a single distribution (be it Uniform, Triangular or something else) ignores shape uncertainty (a sub-set of ignorance), and therefore leads to an overstatement of confidence.

The second problem associated with the application of Monte Carlo techniques to ecological models is that uncertainties about dependencies between variables cannot be expressed (Ferson 1996, 2002). Knowledge about dependencies is typically very poor in ecological models. Clearly, natural systems are complex and dependencies between variables are likely to exist, and these species richness estimation models are no exception. For example, the Janzen–Connell hypothesis (Janzen 1970; Connell 1971) proposes that predation on plants (in part by arthropods) is one of the mechanisms leading to the high plant richness found in the tropics. But there is also a reciprocal relationship because a greater diversity of plant resources provides opportunities for arthropods to specialise over time and thus diversify (Janz et al. 2006). Therefore, over evolutionary time, plant and arthropod communities interact through positive feedback and this increases the richness of both groups. Of course, the exact details and form of such dependencies are unclear, but the fact that they could exist means that they should not be ignored. It is possible that the dependencies themselves are not stable, and are likely to change over evolutionary time and even as a function of anthropogenic changes. Highly specialised species, for example, are not necessarily destined for an evolutionary dead-end, and can even give rise to generalists (Colles et al. 2009). Despite the complexity of such dependencies, the time is ripe to at least introduce the concept to species richness models so that advances in evolutionary biology can be used to modify these simple sample extrapolation models.

With respect to modelling dependencies, the default approach of independence, which is rarely stated explicitly, is intuitively appealing because there is usually not a clear dependency relationship between the various pairs of variables. However, as noted by Tucker and Ferson (2003), independence implies zero correlation but zero correlation does not demand independence. Furthermore, the possibility of higher-order dependencies should not be excluded. That is, not all dependencies will be pair-wise, between two variables: some could be multivariate. Consequently, Ferson (2002) suggests that models should start by making the assumption of dependence between all variables and at all levels, and independence should be assumed only when sound empirical information exists to support it. Vose (2000), whilst acknowledging it is a contentious issue, takes the opposing view, and suggests that one should avoid attempting to model correlation ‘where there is neither a logical reason nor evidence for its existence.’ In line with many Monte Carlo ecological models of systems wherein very little is known about the nature of inter-variable dependencies (Jonzen et al. 2002; Brook et al. 2003), our original species richness model invoked Vose’s philosophy. While dependencies can be specified in the Monte Carlo construct, uncertainty about their nature—magnitude and form—cannot be accommodated. In other words, ignorance about the dependencies cannot be included. Also, as noted by Ferson et al. (2004), the use of correlation coefficients to define dependencies—the typical approach used in Monte Carlo (Vose 2000)—is weak, as a dependency needs to be described by a complete dependency function (a copula), and several copulae can in fact have the same correlation. Finally, Monte Carlo methods do not readily allow for modelling of higher-order dependencies.

Probability bounds (p-bounds) analysis necessitates neither subjective assumptions about distributional form nor the nature of dependencies, and has proved useful in ecological models, where large uncertainties are often associated with these properties (Ferson 2002; Regan et al. 2002). Briefly, p-bounds analysis deals with classes of distributions rather than individual distributions (Frank et al. 1987; Williamson and Downs 1990). It not only offers a method for computing the bounds for a given variable but also enables the convolution (e.g., multiplication, division, addition, subtraction or exponentiation) of these distributional classes, and thus propagation of ignorance and variability together, through the model. While confidence intervals or credible intervals set bounds around a statistic for a variable, effectively as a function of its distribution, p-bounds are bounds surrounding the probability distribution itself. P-bounds must be expressed in terms of the CDF, not probability density or mass functions. In essence, p-bounds analysis can be seen as a highly conservative technique for determining the limits of an infinite array of possible CDFs, and it has been described by Burgman (2005) simply as a more honest approach because the analyst is not forced to make unjustified assumptions to satisfy a mathematical framework (cf. Monte Carlo). Philosophically, p-bounds analysis involves specifying total possible uncertainty and then explicitly removing it, whereas Monte Carlo approaches require uncertainty to be explicitly included. Ferson (2002) describes p-bounds analysis as a useful method for providing ‘quality assurance for Monte Carlo results’. Regan et al. (2002), for example, found for a food-web model that the p-bounds analysis was useful for checking the plausibility of a Monte Carlo model. It is also worth noting that they found the p-bounds envelope on the CDF to be markedly broader than one generated by a second-order Monte Carlo analysis.

Here we use probability bounds analysis to explore the implications of assumptions on the independence of variables and distributional forms used to account for uncertainties made by a previous model on the global species richness estimate for tropical arthropods.

Materials and methods

Hamilton et al. (2010, 2011) presented two models, A and B, which were respectively based on the estimated number of tree species in the tropics and the number of tropical tree genera in New Guinea alone. Here, p-bounds modelling is applied to Model A only (Eq. 1), as this is overwhelmingly the most common approach to the problem (Erwin 1982; Ødegaard 2000; Stork 1988; Thomas 1990). Furthermore, the two models are otherwise analogous. Model A is described in detail in Hamilton et al. (2010), with terminology specifically appropriate to the LHS methodology used.

For the LHS implementation of Model A the following Uniform distributions were used for four variables: c = 1.79–2.70, p cg = 0.25–0.66, p ba = 0.18–0.33, n t = 43,000–50,000. The variable x is the product of n k , the number of herbivorous canopy beetle species on tree species k (\( k = 1,2, \ldots ,l \)), and f k , the proportion of the beetle species effectively specialised on that species (see Hamilton et al. 2010 for calculation of f k ). A distribution for x was then obtained by producing 500,000 non-parametric bootstrap estimates of n k f k . The reader is also directed to the published corrigendum (Hamilton et al. 2011). It is also important to note that x represents an estimate of the average effective specialisation rather than the effective specialisation of a given tree species, n k f k . A distribution of x is what is required for this model. Drawing realisations from the distribution of n k f k would result in a distribution of imprecise estimates of tropical arthropod species richness (i.e., where each estimate is based on a single tree species), rather than a distribution of precise estimates with each being based upon the suite of species. This potential pitfall is common in uncertainty models, as described by Karavarsamis and Hamilton (2010) in a health risk context.

Calculations in p-bounds analysis are made on p-boxes. A p-box is defined as the class of CDFs (F(y)) bounded by a pair of CDFs, \( \underline{F} (y) \) and \( \overline{F} (y) \), such that \( \underline{F} (y) \le F(y) \le \overline{F} (y) \). In our models, two types of p-boxes were constructed for each variable. First, the entire cumulative probability space within the possible range for the variables was represented using ‘minimum–maximum’ (min–max) boxes. This is superficially, and perhaps intuitively, analogous to the use of a Uniform distribution in a Monte Carlo analysis, although it is in fact quite different because it permits all possible cumulative distributions between specified 0th and 100th cumulative percentiles. Second, ‘empirical histogram’ p-boxes were used to include all the available estimates of variables, not just the minima and maxima.

In line with the original paper, we used the review of Ødegaard (2000) to obtain estimates of c, p cg and p ba. It is worth noting that the various studies listed by Ødegaard are not all directly comparable, owing to different sampling techniques, and they do not always explicitly represent the variable of interest, but they characterise the best available information and hence have been used by many authors in extrapolating tropical arthropod species richness estimates (May 1990; Thomas 1990; Stork 1988, 1993; Novotný et al. 2002). Therefore, the following p-boxes were constructed, where MM and EH, respectively, denote variables for which min–max and empirical histogram boxes have been defined, and ‘minmax’ and ‘histogram’ is the respective coding terminology used by Ferson (2002): c MM = minmax (1.79–3.37), c EH = histogram (1.79, 2.70, 1.79, 2.13, 2.27, 2.44, 2.50, 2.70), p cgMM = minmax (0.25, 0.66), p cgEH = histogram (0.25, 0.66, 0.25, 0.33, 0.5, 0.66), p baMM = minmax (0.18, 0.33), p baEH = histogram (0.18, 0.33, 0.18, 0.22, 0.23, 0.23, 0.33), and n tMM = minmax (43,000, 50,000). Note that there were only two estimates available for n t (see Hamilton et al. 2010), hence only a min–max box is required. All the empirical histogram p-boxes are shown in Fig. 1. Min–max p-boxes are not shown because they are simply vertical lines extending from zero to 1 cumulative probability at the minima and maxima. x is the only variable for which p-bounds were not constructed (Fig. 1). Bootstrapping is a sampling procedure, and therefore the resultant distribution will converge to the Normal with increasing sample size, owing to the Central Limit Theorem. P-bounds are appropriate when uncertainty about distributional form exists, but that is not the case here. As noted in Hamilton et al. (2010), the dataset we used for determining the number of herbivorous beetle species effectively specialised on a given tree species is substantially larger than any other available dataset for this parameter. In any case, it is necessary to use a method that is congruent with the original model, so that the implications of the assumptions stated above can be assessed. Given that 500,000 bootstrap replicates were taken, the CDF of x is Normal. This CDF jointly expresses variability and ignorance. Empirical p-bounds constructed from multiple bootstrap replicates simply converge to the CDF as the number of replicates increases, and are meaningless in the context of representing natural variation and ignorance, reflecting nothing other than the effect of computational sample size.

Fig. 1
figure 1

a Cumulative distribution function of the average effective specialisation across all tree species (x). bd Empirical probability bounds (solid lines) for the cumulative distribution functions of the correction factor for non-herbivorous arthropods (c EH), the proportion of all arthropods found in the canopy (p cgEH), and the proportion of beetles that are arthropods (p baEH), respectively. The dashed lines in (bd) represent the cumulative distribution function for uniform distributions defined as follows: c = Uniform (1.79, 2.70), p cg = Uniform (0.25, 0.66), p ba = Uniform (0.18, 0.33)

The original LHS estimate was recalculated (N Ai, Eq. 1). Additionally the following estimators of tropical arthropod species richness were solved for:

$$ N_{\text{Ad}} = \left( {{{xc} \mathord{\left/ {\vphantom {{xc} {p_{\text{cg}} p_{\text{ba}} }}} \right. \kern-\nulldelimiterspace} {p_{\text{cg}} p_{\text{ba}} }}} \right)n_{\text{t}} ; $$
(2)

where the subscript d denotes that no assumptions are made about the nature of dependencies between variables and all variables are represented by the distributions described for Eq. 1,

$$ N_{\text{AiMM}} = \left( {x\left| \cdot \right|c_{\text{MM}} \left| / \right|p_{\text{cgMM}} \left| \cdot \right|p_{\text{baMM}} } \right)\left| \cdot \right|n_{\text{tMM}} , $$
(3)

where all variables are assumed to be statistically independent (i), as marked by the pipes (paired vertical lines) either side of each operator, and most variables are represented by min–max p-boxes,

$$ N_{\text{AdMM}} = \left( {{{xc_{\text{MM}} } \mathord{\left/ {\vphantom {{xc_{\text{MM}} } {p_{\text{cgMM}} p_{\text{baMM}} }}} \right. \kern-\nulldelimiterspace} {p_{\text{cgMM}} p_{\text{baMM}} }}} \right)n_{\text{tMM}} ; $$
(4)
$$ N_{\text{AiEH}} = \left( {x\left| \cdot \right|c_{\text{EH}} \left| / \right|p_{\text{cgEH}} \left| \cdot \right|p_{\text{baEH}} } \right)\left| \cdot \right|n_{\text{tEH}} ,\quad {\text{and}} $$
(5)
$$ N_{\text{AdEH}} = \left( {{{xc_{\text{EH}} } \mathord{\left/ {\vphantom {{xc_{\text{EH}} } {p_{\text{cgEH}} p_{\text{baEH}} }}} \right. \kern-\nulldelimiterspace} {p_{\text{cgEH}} p_{\text{baEH}} }}} \right)n_{\text{tEH}} . $$
(6)

P-boxes were specified and convolved using RAMAS Risc Calc 4.0 (Ferson 2002). Distributions with infinite tails, such as the Normal, which was used for x, cannot be convolved in p-bounds analysis, and therefore truncation was enforced at 0.005 and 0.995 cumulative probability. Truncation was not necessary for the distributions with finite bounds (i.e., minimum–maximum and empirical histogram). Empirical histogram bounds were constructed using Kolmogorov–Smirnov confidence limits of 95 %. The mathematics behind convolving p-boxes is described elsewhere (Frank et al. 1987; Williamson and Downs 1990).

Copulae were used to convolve distributions in solving Eqs. (2)–(6). A copula is a function that describes the dependence relationship between multiple variables by transforming the marginal distributions of each variable to uniform distributions. This works because any variable in the model, y, can be represented by a generalised inverse, v = F −1(u), where F −1 is an inverse CDF and u is a uniformly distributed random variable. Thus a copula function, C, is defined as a function, f, of the generalised inverses \( U = u_{1} , \ldots ,u_{d} \) of d variables, \( Y = y_{1} , \ldots ,y_{d} \), in the model so that

$$ C(U) = f\left( {F_{1}^{ - 1} (u_{1} ), \ldots ,F_{d}^{ - 1} (u_{d} )} \right). $$
(7)

Independence is a special copula function. Where independence is assumed \( C(U) = \prod\nolimits_{i = 1}^{d} {u_{1} } , \) which is identical to the form of Eq. (1). Where no assumptions were made about dependencies, Ferson’s et al. (2004) approach of convolving p-boxes within Fréchet bounds was used. Let

$$ C^{d} (U) = \left( {\sum\limits_{i = 1}^{d} { - C(U)} } \right), $$
(8)

where C(U) is a copula fitting within the lower Fréchet copula bounds

$$ C^{F} (U) = \max \left( {\left( {\sum\limits_{i = 1}^{d} {u_{i} } } \right) - 1,0} \right). $$
(9)

Thus both the lower and upper bounds of the dependency function governing Y enclose a copula describing the dependencies in terms of the general inverse U. Both the upper and the lower bounds enclosing this copula can be expressed in terms of the Fréchet lower bounds, which can then be used to elucidate the nature of the dependency function of Y (Ferson et al. 2004). All possible copulae describing Y are enclosed within the Fréchet bounds such that no assumptions about dependency need be made. Kendall’s grade correlation was then used to describe the nature of the dependencies identified through this process (Ferson et al. 2004).

Results

The Monte Carlo-LHS model yielded a median estimate of 6.1 million species, with a 90 % confidence interval of [3.6, 11.4] million (Fig. 2a). Simply replacing the assumption of pure statistical independence between variables in the model with no dependency assumptions resulted in reasonably broad p-bounds (Fig. 2b), with lower and upper bounds at 0.5 cumulative probability (i.e., at the median estimate) of 2.9–12.7 million. Bounding the input variables had an even larger effect on the bounds for the prediction. In the case of min–max bounds, the probability envelope was so wide and steep that there was negligible difference with respect to the dependency assumptions, with pure independence and no dependency assumptions yielding bounds at 0.5 cumulative probability of 2.35–19.7 million and 2.4–20.0 million, respectively (Fig. 2c, d). Furthermore, the shapes of the bounds were almost identical in both these cases. The use of empirical bounds on the input variables had negligible impact relative to the min–max bounds, and, in fact, for the case of no dependency assumptions, the lower and upper bounds at 0.5 cumulative probability were identical (2.4–20.0 million) to those for the parallel min–max case, but the bounds did vary slightly in shape from those produced from the min–max model (Fig. 2f). Likewise, under the assumptions of pure statistical independence the empirical bounding approach produced bounds of slightly different shape to the min–max model, but the values at 0.5 cumulative probability were very similar (2.7–18.4 million) (Fig. 2e).

Fig. 2
figure 2

a Cumulative distribution function for the original Monte Carlo estimator N Ai (dotted curved line) and associated 5th and 95th confidence limits (dotted vertical lines). The filled circle marks the median. bf Probability bounds (solid lines) for the estimation of tropical arthropod species richness using the following estimators: b N Ad the original estimator but with no dependency assumptions; c N AiMM all variables assumed to be statistically independent from each other and represented with min–max p-boxes; d N AdMM no dependency assumptions, min–max p-boxes; e N AiEH statistical independence; empirical histogram p-boxes; f N AdEH no dependency assumptions, empirical histogram p-boxes. The dotted line in each plot represents the cumulative distribution function for N Ai

Discussion

Probability bounds analysis was used to assess the plausibility of a Monte Carlo model of tropical arthropod species richness. While broad, the bounds rule out the possibility of estimates of 30 million species or greater, with the 100th percentile of the right-hand bound—i.e., the absolutely most conservative scenario—being <30 million in each case. P-bounds define the cumulative probability space in which the true distribution will lie, but it is important to note that each of the infinite number of CDFs within this space is not equally likely. In fact, it could be argued that this approach is markedly too conservative, as it even allows for highly unlikely distribution forms, such as multimodal, that are probably inappropriate for the parameters in this model (or indeed the prediction). Interestingly, the CDF of the Monte Carlo model was always situated toward the left-hand side of cumulative probability space defined by the p-bounds, regardless of the dependency or distributional form assumptions made in assigning these bounds. The reason for this is unknown.

Removing the assumption of independence between variables in the original model and replacing it with no dependency assumptions resulted in reasonably broad p-bounds (Fig. 1b). While the state of knowledge about the nature of dependencies between the model variables is very poor, it is reasonable to expect that some dependencies will indeed exist. For example, the richness and specialisation of insects is not independent of tropical tree species richness, as illustrated through the Janzen–Connell hypothesis (Janzen 1970; Connell 1971), but it is difficult to know the strength of the relationship because the relative contributions of phylogenetic conservativeness and geographic contingency and local mass effects in the assemblage of communities remain unclear (Goßner et al. 2009). It would also be reasonable to hypothesise that the proportion of arthropods—including beetles—that are herbivores is likely to be dependent upon plant species richness. On the whole, plant-feeding arthropods are more specialised and constrained in the diversity of their resource use than non-plant-feeding species, such as carnivores and fungivores (but not parasitoids) (Ross et al. 1982). It could be that this is because dealing with plant physical and chemical herbivore deterrents is more digestively demanding and thus requires a more specialised digestive system. Therefore, it could be hypothesised that over time increasing tree diversity would likely lead to more herbivores and alter the ratio of herbivores to non-herbivores. The relationship between tree species richness and the ratio of herbivores to non-herbivores logically leads to the possibility of secondary dependencies with the canopy to ground ratio, and the proportion of non-beetle arthropods.

The canopy to ground ratio may be dependent upon plant species diversity if there are more plant-feeding arthropods in the canopy (e.g., Grimbacher and Stork 2007). Arthropods associated with the ground are likely to show lower levels of specialisation (e.g., Crutsinger et al. 2008; Donoso et al. 2010). Thus, if there are more plant-feeding arthropods in the canopy, then over evolutionary time an increase in tree species richness might be expected to lead to increasing arthropod richness in the canopy at a greater rate, relative to the ground.

For similar reasons, it may be that the proportion of non-beetle arthropods found in the canopy relative to the ground is related to tree species richness. Unlike the Coleoptera, which are highly diverse in their feeding ecology, species from most insect orders are likely to have one main mode of feeding (Ross et al. 1982). The Lepidoptera, for example, are predominantly herbivorous, while the non-ant Hymenoptera are largely predatory or parasitoid. Because not all arthropod orders contribute equally to global species richness (Nielsen and Mound 2000), and herbivores are likely to have a much tighter association to tree species richness than non-herbivores, the ratio of canopy to ground diversity is likely to alter the relative contribution of non-beetle arthropods.

Of course, these are just some examples of potential dependencies between variables typically used in an extrapolation model of species richness. The exact form of these dependencies is unknown; there are likely to be other dependencies, including those of higher-order. The theoretical arguments given above cover only some processes that could influence dependencies—there may indeed be other processes negating, antagonising, or complementing these. It is for these reasons that Fréchet’s (1935) copula was used to convolve the distributions, as it makes no assumptions about the nature of the dependencies. Other copulae can be used to specify other dependencies, such as perfect, opposite, positive, negative, straight-positive, and straight-negative, and these can be implemented in RiscCalc (Ferson et al. 2004). With further ecological and evolutionary insight into the nature of the dependencies, these less conservative copulae could be used in such models. But before any gains are to be made in narrowing the bounds through this means, the more problematic issue of uncertainty associated with the model variables needs to be addressed, as discussed in the original manuscript (Hamilton et al. 2010). This had a larger effect on the breadth of the bounds than the independence assumption (Fig. 2b, c, e).

Since Erwin presented the extrapolation approach to estimating tropical arthropod species richness, overwhelmingly the debate has centred around what values best represent the variables (Erwin 1988; Thomas 1990; Stork 1988, 1993; Ødegaard 2000), but this thinking needs to be broadened to consider the relevance of potential dependencies between variables and the relative merits of different technical approaches to uncertainty modelling in this context. Furthermore, other methods of estimating tropical arthropod species richness would benefit from more thorough use of uncertainty modelling, including, inter alia, extrapolations from known faunas and regions, methods using ecological models, eliciting taxonomists’ views, and species description rates (Stork 1993). Mora et al. (2011) recently made a step in this direction through accommodating uncertainty in their taxonomic-level based global species richness model, which produced a median estimate of 8.7 million eukaryotic organisms on Earth (±1.3 million SE), with 6.5 million of these being terrestrial, which accords well with our original median estimate of 6.1 million tropical arthropods (Hamilton et al. 2010, 2011). Another recently published model (Costello et al. 2012), based on species description rates, also accounted for uncertainty, and produced a median estimate of 490,960 (95 % CI = 449,010, 477,990) terrestrial species remaining to be described, which equates to only 1.6–1.7 million terrestrial species existing globally, a much lower prediction than that of Mora et al. (2011) and Hamilton et al. (2010, 2011). The variation in such models highlights the need to consider uncertainty surrounding this important question even more broadly, that is, not just within models but between models. While tropical arthropods species are of primary interest on a global scale, given their high richness and the potentially huge numbers of undescribed species, improved uncertainty modelling can contribute also to other large-scale species richness estimates, be it European marine species (Wilson and Costello 2005) or flowering plants globally (Joppa et al. 2011). Every statistical approach has something to offer but equally has its limitations; the next step in tackling this important question will be to combine models and their associated uncertainties, perhaps using techniques such as Bayesian modelling averaging.