1 Introduction

The bootstrap, introduced by Efron (1979), is a technique for determining the accuracy of statistics in circumstances in which confidence intervals cannot be obtained analytically or when an approximation based on the limit distribution is not satisfactory (Efron and Tibshirani 1993; Davison and Hinkley 1997). Bootstrap techniques have become very popular in many areas of environmental sciences, including frequency analysis in climatology and hydrology (Dunn 2001; Hall et al. 2004; Ames 2006; Kyselý and Beranová 2009; Twardosz 2009; Fowler and Ekström 2009). There are two basic approaches to the bootstrap: While the nonparametric bootstrap is based on resampling with replacement from a given sample and calculating the required statistic from a large number of repeated samples (it is often termed simply ‘resampling’), the idea of the parametric bootstrap is to randomly generate samples from a parametric model (distribution) fitted to the data and to calculate the statistic from a large number of randomly drawn samples. In both cases, one attempts to infer a distribution of the estimate of a given statistic (e.g., model parameter, quantile of a distribution) from the available data.

The nonparametric bootstrap is often applied when estimating uncertainties involved in frequency models as a simple and intuitive first guess. It has been examined in terms of simulation experiments that evaluated the utility of the methods (Hall et al. 2004; Ames 2006) and widely applied in analyses of observed datasets as well as model outputs. However, when the data samples are small, their distributions are skewed, and a suitable parametric model can be assumed (which is the usual case in a frequency analysis of precipitation amounts), the parametric approach to the bootstrap may be advantageous. Kyselý (2008) quantified the performance of nonparametric and parametric bootstraps for several frequency models used in extreme value analysis, concluding that the parametric bootstrap should be preferred in most cases, more importantly for heavy-tailed distributions (typical for precipitation amounts) than light-tailed distributions (typical for air temperature data). Nevertheless, the study evaluated performance of the two bootstrap methods only for one value of the tail index (shape parameter) of heavy-tailed distributions, and possible dependence on the tail behavior was not examined.

Herein, we analyze the performance of nonparametric (NP) and parametric (P) bootstraps in terms of simulation experiments for a wide range of heavy-tailed distributions used for modeling, among other things, probabilities of extreme precipitation. Heavy-tailed distributions are not exponentially bounded, i.e., they have heavier upper tails than do exponential-like distributions. In other words, when data follow a heavy-tailed distribution, design values corresponding, e.g., to 50- or 100-year return levels may be severely underestimated if the heavy tails are not correctly represented in the model used for the estimation (the limiting Gumbel distribution for maxima, sometimes applied also for modeling precipitation extremes, is not heavy-tailed). There is a consensus that extremes of some environmental variables are heavy-tailed (Katz et al. 2002), including maxima of precipitation amounts (Buishand 1991; Egozcue and Ramis 2001; Kyselý and Picek 2007) and streamflow (Farquharson et al. 1992; Anderson and Meerschaert 1998; Kochanek et al. 2008), but also less common variables like sedimentation rates in lakes (which are sensitive to extreme precipitation; Lamoureux 2000).

The paper is organized as follows: in Section 2, the methodology and settings of the simulation experiments are given. Differences between coverage probabilities of confidence intervals obtained with the NP and P bootstraps are quantified in Section 3, and their dependence on the tail index and sample size is evaluated. An application of the two bootstrap methods to confidence intervals of high quantiles of observed precipitation data is shown in Section 4, and implications for use of the bootstrap confidence intervals in heavy-tailed frequency models are discussed in Section 5.

2 Methods

Simulation experiments are carried out with a number of combinations of true (parent) and fitted probability distributions. The parent distribution is that from which random artificial data samples of a specified size are drawn in the first step of each experiment; the fitted distribution is the one that is adopted for the estimation in the artificial data. Analogously to Kyselý (2008), the size of the artificial samples n was set to 20, 40, 60, and 100 in each experiment to span a range of values typical of time windows for which climatological datasets are analyzed.

2.1 Fitted model

The generalized extreme value (GEV) distribution (Appendix 1) is applied as the fitted distribution in most experiments. It includes three models for maxima of asymptotically large samples (Gnedenko 1943), and it is widely used in frequency modeling of heavy precipitation (Semmler and Jacob 2004; Gaál et al. 2008; Overeem et al. 2008; Fowler and Ekström 2009), air temperature (Kharin and Zwiers 2000, 2005; Kyselý 2002; Khaliq et al. 2005), low streamflow (Onoz and Bayazit 1999; Kroll and Vogel 2002; Hewa et al. 2007), floods (Martins and Stedinger 2000; Kumar and Chatterjee 2005; Cunderlik and Ouarda 2007), durations of wet and dry spells (Kharin and Zwiers 2000; Voss et al. 2002; Lana et al. 2006), wind speed (van den Brink et al. 2004), and other variables.

Additional experiments with the generalized Pareto (GP) distribution (Appendix 2) as the parent and fitted distribution are carried out in order to highlight general tendencies for heavy-tailed data. The GP distribution is useful in the ‘peaks-over-threshold’ (POT) method for modeling excesses above a sufficiently high threshold. Such approach is preferred when whole time series of data are available, due to the increase in the amount of data entering the estimation procedure. Applications of the POT method include the frequency analysis of air temperatures (Brabson and Palutikof 2002; Katsoulis and Hatzianastassiou 2005; Kyselý et al. 2008), precipitation amounts (Begueria and Vicente-Serrano 2006; Bacro and Chaouche 2006; Kyselý and Beranová 2009), floods (Adamowski 2000; Cox et al. 2002; Prudhomme et al. 2003), dry spells (Lana et al. 2006), wind speeds (Dupuis and Field 2004; An and Pandey 2005), and wave heights (Pandey et al. 2004).

2.2 Description of the parent models and the simulation experiments

The settings of the individual simulation experiments (denoted E1 to E4) are summarized in Table 1. Note that the parameterization used throughout the paper is in agreement with Hosking and Wallis (1997), i.e., shape parameter k < 0 corresponds to a heavy-tailed distribution (cf. Appendices 1 to 3).

Table 1 Summary of settings of the simulation experiments

In experiments E1, the GEV distribution is used as the parent as well as the fitted distribution. This combination of the true and fitted model represents the case when a correct parametric model is adopted for the examined samples. The tail behavior of the GEV distribution is governed by the shape parameter k, and the choice of the other two parameters (location and scale) could be rather arbitrary; their values (given in Table 1) reflect typical distributions of daily maxima of precipitation amounts (in mm) over some land areas in mid-latitudes. We examine four GEV distributions as parent, with values of k ranging from −0.4 (a pronounced heavy tail) to −0.1 (in which case the tail behavior differs little from the limiting Gumbel case). Such a range of the tail index covers typical values estimated, for example, for distributions of annual maxima of 1-day and multi-day rainfall amounts in central Europe (Kyselý and Picek 2007). Although tails heavier than k = −0.4 may sometimes be found in practical applications, too, we note that already for GEV distributions with k ≤ −0.33 (k ≤ −0.25) the third (fourth) statistical moment does not exist (Hosking and Wallis 1997). The bias in the estimates of k and high quantiles from samples drawn from such heavy-tailed distributions becomes more important (cf. the bias of the shape parameter and the 100-year return values in experiments with k = −0.4 in Tables 2 and 3), which also makes the application and comparison of bootstrap confidence intervals for k ≲ −0.4 less straightforward.

Table 2 True (parent) values and mean estimated values of the shape parameter (k) for individual experiments and sample sizes (n)
Table 3 True (parent) values and mean estimated values of the 100-year return level for individual experiments and sample sizes (n)

Other experiments that make use of a correct parametric model for the estimation are E2, in which the GP distribution is the true as well as the fitted distribution. The shape parameter k is analogous in the GP and GEV distributions, so the same set of values for k is used as in E1. The location parameter is usually known in applications of the POT method and equals zero (when the threshold that delimits extremes is set), so the two-parameter version of the GP distribution is adopted (Appendix 2). In order to allow for a straightforward interpretation, return levels are inverted from the estimated quantile function under the assumption that the frequency of exceedances is one per year, i.e., the same as in the conventional ‘annual maxima’ method with the GEV distribution. This latter setting corresponds to the value of the mean exceedance rate in the GP-Poisson process model (Coles 2001) that is equal to 1.

The other two sets of experiments, E3 and E4, are carried out to demonstrate differences between the bootstrap confidence intervals when false parametric models are assumed and adopted for the examined samples. This condition may be common in applications in which the true (parent) distributions are unknown, although the appropriate model may be selected according to goodness-of-fit tests; the probability of selecting a false distribution for given data increases with decreasing sample size.

In experiment E3, the generalized logistic (GLO) distribution (Appendix 3) is used as parent but the GEV distribution is utilized for the estimation. GLO is a model that has become popular in hydrology following the study of floods by Ahmad et al. (1988), and it also has been found to be a useful distribution for maxima of precipitation amounts (Shoukri et al. 1988; Asquith 1998; Lee and Maeng 2003; Fitzgerald 2005; Kyselý and Picek 2007; Zin et al. 2009). According to the Flood Estimation Handbook (IH 1999), it has been recommended as the standard for flood frequency analysis in the UK. We use a reparameterized version of the log-logistic distribution of Ahmad et al. (1988), in which the parameters are analogous to those of the GEV distribution (Hosking and Wallis 1997; Appendix 3). Since the GEV and GLO distributions are closely related models that rank among distributions with the same weight of the upper tails, the setting of experiments E3 means that a false but related parametric model is adopted for the estimation. The same set of parameters as in E1 is used for the GLO distribution, with the shape parameter k varying again between −0.4 and −0.1. Note that high quantiles of the GEV and GLO distributions with the same parameters are very similar (Tables 2 and 3).

In the last set of experiments, E4, the samples are drawn from a double-populated GEV–GLO model, i.e., a mixture of two distributions (Fig. 1). Three quarters of data in each artificial sample are drawn from the GEV distribution while the remaining quarter originates from the GLO distribution. This may represent a condition when two mechanisms producing extremes—for example heavy precipitation—are present in a sample: most extremes arise from an ‘ordinary’ population but less frequently there also occur events from a secondary ‘extra-ordinary’ population (cf. van den Brink et al. 2004). A relatively large fraction of data from the secondary population (25%) is chosen in order to highlight the differences from experiments E1, since for decreasing fractions the results converge to those of E1. The GLO distribution with a shifted location and a heavy upper tail (k = -0.4) is used to represent the secondary population. The model parameters (which again span a range of values for the tail behavior of the primary GEV distribution) are summarized in Table 1, and the probability density functions of the mixed models together with both components are plotted in Fig. 1. The GEV distribution is again adopted as a model for the data. The setting of experiments E4 corresponds to the case when a simplified model is fitted to an examined sample.

Fig. 1
figure 1

Probability density functions of the two partial distributions and the double-populated parent models in experiments E4

2.3 Other settings of the simulation procedure

The simulation procedure in each experiment and each combination of sample size (n = 20, 40, 60, and 100) and tail behavior (governed by k) is as follows (Kyselý 2008):

  1. 1.

    Five thousand artificial samples of n values are randomly drawn from the specified parent distribution (or the mixture of the parent distributions in E4).

  2. 2.

    To each artificial sample, the GEV or GP distribution is fitted and its quantiles corresponding to the return levels of 5 to 200 years are estimated.

  3. 3.

    The 90% and 95% confidence intervals (CIs) of the model parameters and quantiles are estimated from the P and NP bootstraps. The former involves generating a large number of random samples from the fitted distribution (with parameters estimated from the artificial sample); the latter consists in a simple resampling with replacement of the artificial sample.

We confine our attention in this study to the most widely-used percentile CIs. For both bootstrap approaches and all artificial samples, 1,000 iterations are carried out to estimate 2.5%, 5%, 95%, and 97.5% quantiles of distributions of the 5- to 200-year return levels, which delimit the 90% and 95% CIs. The method of L-moments (Hosking 1990) is used for estimating the parameters and quantiles of the GEV/GP distribution.

The performance of the NP and P bootstraps is evaluated in terms of empirical coverage probability of the CIs, i.e., the percentage of simulated results for which the estimated 90% and 95% CIs cover the true values of the quantiles (which are determined from parameters of the parent distribution). It is expected that an appropriate (‘correct’) method yields coverage close to the nominal value of 90/95% while a higher (lower) value points to CIs that are too wide (narrow) compared to the real uncertainty, provided that the quantile estimates are not biased.

3 Results

3.1 Experiments E1 (GEV fitted to GEV-distributed data)

For all values of the shape parameter k and all examined sample sizes (n = 20, 40, 60, and 100), the P bootstrap performs considerably better in terms of the coverage probability of the CIs (Fig. 2). The differences are particularly important for small and moderate sample sizes (n = 20, 40) and for the very heavy-tailed GEV model (k = −0.4). For example, the 90% CIs of the 100-year return values estimated from samples with 40 members drawn from the GEV distribution with k = −0.4 cover the true value only in 64% of cases when the NP bootstrap is used. This is improved to 81% for the P bootstrap (a value still lower than the 90% that is expected). The coverage probabilities of the 90% and 95% CIs of the 100-year return levels are summarized for all experimental settings in Table 4; the findings are analogous notwithstanding whether the 90% or 95% CIs are considered.

Fig. 2
figure 2

Dependence of the coverage probability of the 90% CIs from the parametric (P) and nonparametric (NP) bootstraps on the T-year return level (T = 5 to 200) in experiments E1, for individual sample sizes (columns) and values of the tail index (rows)

Table 4 Coverage probabilities of the 90% and 95% CIs of the 100-year return levels estimated from the nonparametric (NP) and parametric (P) bootstraps. k denotes the shape parameter of the parent distribution, n stands for the sample size

It should be noted that the coverage probability of the 90% CIs for all values of k, n, and in the whole range of the 5- to 200-year return levels is lower than the nominal value of 90% for both the NP and P bootstraps (Fig. 2). This means that the CIs constructed using the bootstrap are always too narrow and undervalue the uncertainty involved in the estimates. However, this underestimation is much less severe when the P version of the bootstrap is employed.

Another favorable property of the P bootstrap is that the coverage probability of the CIs is almost independent on the return level. Except for very small samples (n = 20) drawn from the GEV distribution with k ≤ −0.3, the coverage probability of the 90% CIs constructed by means of the P bootstrap is at least 80% for all return levels, and it is close to the nominal value of 90% for moderate and large sample sizes and less heavy-tailed distributions (Fig. 2).

3.2 Experiments E2 (GP fitted to GP-distributed data)

Similar results are achieved in simulation experiments E2 in which the GP distribution is fitted to the GP-distributed data (Table 4, Fig. 3). Differences between the NP and P bootstrap are slightly less pronounced but the general tendencies remain unchanged: the P bootstrap always performs better, CIs from both the P and NP bootstraps have always too low coverage, the improvement of the P over NP bootstrap is particularly important for very heavy-tailed data and small sample sizes, and the coverage probability of the 90% CIs of quantiles corresponding to the 5- to 200-year return levels from the P bootstrap is at least 80% except for n = 20 and k ≤ −0.3.

Fig. 3
figure 3

Same as in Fig. 2 except for experiments E2

Differences in coverage probabilities of CIs of high quantiles between the experiments with the GEV and GP distributions are relatively minor for the P bootstrap, which reflects the fact that the shape parameter is analogous in the two distributions. Some differences between the behavior of the coverage probabilities in the two experiments are related to the fact that the GP distribution is estimated as a two-parameter model (Appendix 2), with the location defined by the fixed threshold also in practical applications, and a slightly different bias of estimates in the two models. The fact that the number of free parameters is smaller and skewness of the data sample (l 3/l 2) is not employed in estimating the GP distribution (in contrast to GEV and GLO—Appendix 1 and 3) is manifested in a tendency to a larger positive bias of the estimates of the shape parameter, particularly for small samples (Table 2).

3.3 Experiments E3 (GEV fitted to GLO-distributed data)

The performance of both bootstrap methods deteriorates in experiment E3 (a false model fitted to heavy-tailed data) compared with experiments E1 and E2, and the coverage probability becomes particularly low for high quantiles (Fig. 4). However, the P bootstrap performs better in most cases even though the parametric model assumed for the data is misspecified. Only for the combination of large sample sizes (n = 60, 100) and little pronounced heavy tail (k = −0.1) does the NP bootstrap outperform the P bootstrap. On the other hand, the superiority of the P bootstrap is obvious even for large sample sizes (n = 100) with pronounced heavy tails.

Fig. 4
figure 4

Same as in Fig. 2 except for experiments E3

It should be emphasized that the coverage probability of the 90% (95%) CIs for 100-year return levels, except for large samples (n = 100), does not exceed 79% (85%) for the P bootstrap and 72% (76%) for the NP bootstrap (Table 4). That is to say that the real uncertainties of the estimates are always substantially underestimated. The low coverage is related to the bias of the estimated model, and increasing positive bias of k (manifested also in negative bias of high quantiles) for less heavy-tailed parent distributions (Tables 2 and 3) may explain the worse performance of the P bootstrap for less-pronounced heavy tails.

3.4 Experiments E4 (GEV fitted to double-populated GEV–GLO data)

The P bootstrap outperforms the NP bootstrap in all settings of experiments E4, in which a simplified (GEV) model is applied to double-populated GEV–GLO data (Fig. 5). As in experiments E1 and E2, the differences decrease with increasing sample size but are still evident for n = 100. The coverage probability of the 90% (95%) CIs for the 100-year return levels, except for large samples (n = 100), does not exceed 72% (77%) for the NP bootstrap (Table 4). The performance of the NP bootstrap is particularly poor for very small samples (n = 20), for which the coverage probability of the 90% CIs from the NP bootstrap is between 50% and 60% for return levels T ≥ 50 years (Fig. 5). The coverage is improved considerably with the P bootstrap (75–80%).

Fig. 5
figure 5

Same as in Fig. 2 except for experiments E4

With increasing k (towards less heavy tails) of the parent GEV distribution, the coverage probability of the CIs from the P bootstrap deteriorates for high quantiles as the two parent distributions become more dissimilar (with respect to the tail behavior) and the fitted model less appropriate; the two populations that produce the samples are not differentiated in the fitted model. Another feature related to the bias of the adopted model is that for less-pronounced heavy tails, there is little improvement in the coverage probability of the CIs with increasing sample size for the P bootstrap (unlike the NP bootstrap; bottom row of Fig. 5). Nevertheless, these are not arguments against the P bootstrap: the NP bootstrap performs always worse, and the sample size of n = 100 may be large enough to recognize in a practical application that the single-population GEV model is not suitable for such data.

4 Application to observed precipitation data

To demonstrate differences between application of the NP and P bootstraps to real climatological data, we compare CIs for high quantiles of precipitation amounts estimated by the two bootstrap approaches. The examined dataset consists of annual maxima of 1- and 5-day precipitation amounts measured at 175 rain-gauge stations covering the area of the Czech Republic, with complete series over 1961–2005. The spatial distribution of the stations is shown in Fig. 6. The dataset originates from Kyselý (2009), who examined trends in characteristics of heavy precipitation in individual seasons, and it is superior in terms of spatial coverage and data quality to the one used in a previous study on statistical modeling of precipitation extremes in this area (Kyselý and Picek 2007). The assumption of stationarity of the examined data was checked before application of the extreme value analysis: trends significant at p = 0.05 (according to the Mann–Kendall test) were observed at 5.7% (3.4%) of stations for annual maxima of 1-day (5-day) precipitation amounts, i.e., the percentage of significant trends at the given level is close to the nominal value of 5% in both cases.

Fig. 6
figure 6

Spatial distribution of the 175 stations with precipitation data over 1961-2005

The GEV distribution was fitted to the individual stations’ datasets using the method of L-moments, and both bootstrap approaches were used to estimate the 90% CIs of model parameters and quantiles corresponding to the return levels of 10, 20, 50, and 100 years. The number of repetitions in both NP and P bootstraps was set to 1,500.

Figure 7 shows scatter-plots of the relative width of the estimated 90% CIs against the shape parameter for individual return levels (the relative width of the CIs, i.e., the width of the CIs scaled by the value of the quantile corresponding to the return level, is plotted in order to remove variations related to the magnitude of the quantile itself, e.g., larger values at mountain stations). Although the range of the estimated values of the shape parameter is wide, the estimated GEV distribution is heavy-tailed at a large majority of the stations (151/156 for 1-day/5-day maxima).

Fig. 7
figure 7

Scatter-plots of the relative width of the estimated 90% CIs against the estimated shape parameter for individual return levels (r.l.) and annual maxima of 1-day (top) and 5-day (bottom) precipitation amounts at 175 stations. The width of the CI is scaled by the estimated value corresponding to the given return level at each station

For all return levels, there is a tendency to more liberal (narrower) CIs from the NP bootstrap. The percentage of stations at which the CIs from the NP bootstrap are narrower than those from the P bootstrap is summarized in Table 5, and the average relative widths of the CIs are shown in Table 6. As expected, the differences between the NP and P bootstraps increase with the return level; they are small for 10-year return values but become quite pronounced for 50- and 100-year return values (Table 6, Fig. 7). For 100-year return values of 1-day precipitation amounts, the average relative width of the 90% CIs is 66.3% when the P bootstrap is applied while only 49.9% when using the NP bootstrap. These values are averaged over all 175 stations, notwithstanding whether heavy-tailed or light-tailed GEV distribution is estimated. If only sites with an estimated heavy tail (k < 0) are considered, the difference is even more pronounced—the average relative width of the 90% CIs is 70.7% for the P bootstrap and 52.2% for the NP bootstrap. Since both NP and P bootstrap tend to yield CIs that are too narrow for heavy-tailed data, as shown above, the uncertainty of the high quantiles tends to be substantially underestimated when using the NP bootstrap while the underestimation is at least partly rectified with the P bootstrap.

Table 5 Percentage of stations at which the 90% CIs estimated from the NP bootstrap are narrower than those from the P bootstrap for high quantiles of distributions of observed precipitation data (175 stations, 45 years)
Table 6 Average relative widths of the 90% CIs (in %) estimated from the NP and P bootstrap for high quantiles of distributions of observed precipitation data (175 stations, 45 years). The width of the CI is scaled by the estimated value corresponding to the given return level at each station

Other features of the CIs that are demonstrated in Fig. 7 include dependence of the width of the CIs on k and growing width of the CIs with rising return level (the latter being increasingly important for very heavy-tailed data). Also noteworthy is the fact that the dependence of the width of the CIs on k is close to linear for the P bootstrap, represented by a much narrower band for the P bootstrap than for the NP bootstrap. Owing to the length of the examined precipitation datasets (45 years), sampling variability may strongly influence also the bounds of the estimated CIs for the NP bootstrap (while it is ‘smoothed’ with the P bootstrap). This is manifested, among other, in some outlying estimates of the relative width of the 90% CIs from the NP bootstrap in Fig. 7, particularly a large outlier in the upper row of the plots (for 1-day maxima) for the 50- and 100-year return levels. Scrutiny of the data reveals that this outlying estimate appears at a station affected by extreme rainfall on July 22, 1998 (resulting in a severe flash flood in eastern Bohemia), with 24-h precipitation amount of 163 mm, while the second largest daily amount at this site over 1961–2005 was only 83 mm. A bootstrap that consists purely in resampling with replacement of the 45 values of annual maxima puts too much weight onto the single extreme observation, and this leads to inflated confidence bounds for the estimates in this specific case of a heavy-tailed sample. This example demonstrates that estimates based on the NP bootstrap are much more sensitive to random sampling variability and much less consistent between samples (stations in this case) than those obtained with the P bootstrap.

5 Discussion

The study compares performance of two basic variants of bootstrap—parametric and nonparametric—for estimating CIs of high quantiles in heavy-tailed data, which are typical for precipitation extremes and some other climatological and hydrological variables. When a correct parametric model is fitted to data drawn from the GEV or the GP distribution, the parametric bootstrap performs considerably better for all examined return levels (5 to 200 years), sample sizes (n = 20, 40, 60 and 100), and tail behaviors (the shape parameter k ranging from −0.4 to −0.1). The parametric bootstrap is preferred also when a false model (GEV) is fitted to GLO-distributed data, except for the distribution with the least heavy tail (k = −0.1) and large sample sizes. Since probability of selecting an incorrect parametric model (by means of goodness-of-fit tests) declines with an increasing size of the data sample, the superiority of the nonparametric bootstrap in this particular case is of little practical importance.

The last-examined experiments make use of a simplified model (GEV) adopted for mixed (double-populated) data drawn from combinations of the GEV and GLO distributions. This may represent a relatively frequent case in extreme value analysis when the samples examined arise from populations governed by different extreme-generating mechanisms (characterized by specific distributions), which are, however, difficult to disaggregate from data records. The coverage probability of the CIs constructed from the parametric bootstrap is always better in the experiments with the mixed models, too, even for large sample sizes (n = 100).

A tendency to more liberal (narrower) CIs from the nonparametric than parametric bootstrap is clearly demonstrated in the application to high quantiles of distributions of observed maxima of 1- and 5-day precipitation amounts, measured at 175 stations over 1961–2005. The differences increase with the return level, and the relative width of the 90% CIs of the 100-year return values of both 1- and 5-day precipitation amounts is reduced on average by 25% when the nonparametric bootstrap is used instead of the parametric bootstrap. This reduction is likely to increase if inference is based on samples from shorter time periods. Another advantage of the CIs from the P bootstrap, demonstrated in the application to real data, is that the estimates are much less influenced by random (sampling) effects.

It should also be stressed that in all the simulation experiments, the constructed CIs are too narrow and too often miss the true values of model parameters and quantiles. This means that the uncertainty of the parameters and quantiles is underestimated, more importantly so for the nonparametric bootstrap. The underestimation appears to be a general feature of bootstrap CIs for heavy-tailed data and is related to skewness in the distributions of estimates of model parameters (Tajvidi 2003). We show that the underestimation of uncertainty is more important

  • For the nonparametric than parametric bootstrap

  • For small sample sizes

  • For higher quantiles (except when the correct model is fitted), and

  • When an incorrect (although related) parametric model is used

This suggests that bootstrap should be regarded as the first guess of the uncertainty, and alternative methods—e.g. analytical expressions for the sampling variance of quantiles of the distributions (Lu and Stedinger 1992; Kjeldsen and Jones 2004) or likelihood-based confidence intervals (Tajvidi 2003)—should be considered at least for comparison. An inference relying uncritically on bootstrap may obviously be misleading.

The present simulation experiments examined behavior of bootstrap CIs for a range of frequency models. Although all possible cases encountered in analyses of precipitation data cannot be covered, the simulation results appear to be indicative of some general tendencies of CIs constructed using the bootstrap (as regards the dependence of results on the sample size, tail behavior, and ‘correctness’ of the parametric model). We also confined ourselves to the percentile CIs since these are the most popular; see, e.g., Carpenter and Bithell (2000) or Dixon (2002) for a brief review on advanced versions of bootstrap CIs. The percentile and ‘bias-corrected and accelerated’ (BCa; Efron and Tibshirani 1993) bootstrap CIs were compared by Dupuis and Field (1998) and Kyselý (2008) for the GEV distribution, and Tajvidi (2003) for the GP distribution; the BCa CIs are usually superior, but the coverage probability is still lower than the nominal value. More sophisticated bootstrap procedures do not compensate for insufficient data, so the poor performance of the nonparametric bootstrap in small sample sizes does not much depend on the variant of bootstrap CIs.

6 Conclusions

The basic choice of bootstrap method (nonparametric vs. parametric) used for estimating uncertainties in frequency models is usually not justified in climatological applications, and respective limitations and drawbacks of the two bootstraps are not discussed and/or evaluated. We provide arguments for using the parametric version of the bootstrap for constructing quantile confidence intervals in heavy-tailed frequency models, provided that the suitable parametric model is known or can be assumed (which is almost always the case in modeling probabilities of precipitation extremes). Even a moderate misspecification of the distribution does not prevent the parametric bootstrap from performing better than the nonparametric one. Inasmuch as a severe misspecification of the parametric model adopted for examined data is unlikely provided that the model is supported by some goodness-of-fit tests and/or other statistical tools (such as the L-moment ratio diagram; Hosking and Wallis 1997) and the sample’s time period is not extremely short, we find it difficult to identify any reasons for using the nonparametric bootstrap. Confidence intervals constructed using the nonparametric bootstrap should be interpreted very cautiously, and especially so for small and moderate sample sizes and for distributions with very heavy tails, as they may severely undervalue the true uncertainty of the estimates. This is also the reason why the nonparametric bootstrap should be avoided when estimating uncertainty of design values for use in practical applications.