1 Background

1.1 Practical motivation

The relatively recent increase in frequency, and severity, of destructive stormy weather in the UK has stirred renewed interest in the analysis of environmental extremes, practitioners often being motivated by the estimation of the r-year return level—for example, the sea-surge we might expect to see over-topped once, on average, every r years. Structural failure of a sea wall is possible if extreme surges are observed; estimates of the r-year return level are used to inform the design of such structures, and so the accuracy and precision of such estimates are of paramount importance. Recent work in Fawcett and Walshaw (2007, 2012) and Eastoe and Tawn (2012) revealed estimation bias for model parameters, as well as return levels, within a standard peaks over thresholds (POT) framework, in some cases resulting in significant under-estimation of return levels.

Estimation precision is often hampered by a lack of data on extremes; as Davison and Smith (1990) demonstrate, confidence intervals for return level estimates can be so wide that they become practically unusable. Our aim is to exploit fully any quantifiable information on temporal dependence and knowledge of seasonal variability to maximise data usage and estimation precision, whilst avoiding altogether the aforementioned problems associated with POT analyses. Working within the Bayesian framework gives the potential to facilitate these aims still further, enabling any extremal analysis to be augmented through the incorporation of prior information. Estimates of the posterior predictive return level can also give the practitioner a single design parameter estimate within which uncertainty in model estimation and future observations have been properly acknowledged.

1.2 Statistical modelling

Key results in extreme value theory, discussed in detail in Coles (2001), Chap. 3, point to the generalised extreme value (GEV) distribution as a model for block maxima of independent observations, with distribution function (d.f.)

$$ G(y) = {\text{ }}\left\{ {\begin{array}{*{20}l} {\exp \left[ { - \left( {1 + \xi (y - \mu )/\varsigma } \right)^{{ - 1/\xi }} } \right],} \hfill & {\xi \ne 0} \hfill \\ {\exp \left[ { - \exp \left( { - (y - \mu )/\varsigma } \right)} \right]} \hfill & {\xi = 0,} \hfill \\ \end{array} } \right. $$
(1)

defined on \(\{y: 1+\xi (y-\mu )/\varsigma >0\}\), where \(-\infty < \mu < \infty \), \(\varsigma >0\) and \(-\infty < \xi < \infty \) are parameters of location, scale and shape respectively; the case \(\xi =0\) is taken to be the limit as \(\xi \rightarrow 0\). If block maxima have limiting distribution as given by (1), then an alternative characterisation of extremes, in terms of magnitudes of excess over some high threshold u, leads to the generalised Pareto distribution (GPD) with d.f.

$$ H(y) = {\text{ }}\left\{ {\begin{array}{*{20}l} {1 - \left( {1 + \xi y/\sigma } \right)^{{ - 1/\xi }} ,} \hfill & {\xi \ne 0} \hfill \\ {1 - \exp \left[ { - y/\sigma } \right]} \hfill & {\xi = 0,} \hfill \\ \end{array} } \right. $$
(2)

defined on \(\left\{ y: y>0 \,{\text { and }} \,(1+\xi y/\sigma)>0\right\} \). The parameters of the GPD are uniquely determined by those in the GEV: specifically, the GPD scale \(\sigma =\varsigma + \xi (u-\mu )\). Results in Leadbetter et al. (1983) show that, in the presence of short-term dependence, distributions (1) and (2) will be powered by the extremal index \(\theta \in (0,1)\), a key parameter quantifying this dependenceFootnote 1: as \(\theta \rightarrow 0\) we see increasing dependence in the extremes of the process.

The r-year return level \(z_{r}\) can then be obtained by inversion of \(G^{\theta }(z_{r})\) or \(H^{\theta }(z_{r})\). For example, in the case of threshold excesses, on equating to \(1-r^{-1}\) this gives

$$ z_{r} = \left\{ {\begin{array}{*{20}l} {u + \sigma \xi ^{{ - 1}} \left[ {\left( {\lambda _{u}^{{ - 1}} w_{r} } \right)^{{ - \xi }} - 1} \right]} \hfill & {\xi \ne 0} \hfill \\ {u - \sigma {\text{ }}\log \left( {\lambda _{u}^{{ - 1}} w_{r} } \right)} \hfill & {\xi = 0,} \hfill \\ \end{array} } \right. $$
(3)

where \(w_{r}=1-\left[ 1-(rn_{y})^{-1}\right] ^{1/\theta }\), \(\lambda _{u}\) is the rate of threshold excess and \(n_{y}\) is the (average) number of observations per year. An estimate of \(z_{r}\), say \(\hat{z}_{r}\), is usually obtained by replacing \(\sigma \) and \(\xi \) in Eq. (3) with their maximum likelihood estimates \(\hat{\sigma }\) and \(\hat{\xi }\). A typical threshold-based analysis circumvents the estimation of \(\theta \) by fitting the GPD to a set of independent cluster peak excesses; a filtering scheme extracts the single largest observation within a cluster of excesses of u, these clusters terminating once a run of \(\kappa \) consecutive sub-threshold observations is made. Thus, it is assumed that the extremes being used are independent, giving \(\theta \approx 1\) in (3) and a POT analysis, as referred to Sect. 1.1.

To date, no general theory for non-stationary extremes has been established. As Coles (2001), Chap. 6 discusses, ignoring such non-stationarity can lead to bias in estimates of model parameters. In practice, pragmatic solutions have been proposed based on the type of non-stationarity observed. For example, trend can be incorporated through linear modelling of the GEV location parameter. More generally, the extreme value parameters can be written in the form \(h(X^T\beta )\) where h is a specified function, \(\beta \) is a vector of parameters and X is a model vector. Smoothly varying seasonal model parameters, or a simpler seasonal piecewise approach, can also be used to account for seasonal variability (see Sect.  2.2, and Coles (2001), Chap. 6 for more examples; more generally, see Jonathan et al. (2014) for a comprehensive review). In Sect. 3.3 we review recent developments for modelling dependence between extremes which occurs spatially.

1.3 Illustrative applications

Figure 1 (left) represents a series of 3-hourly sea-surges collected at Newlyn, UK (1999–2001 inclusive), and (right) a section of a series of hourly gust wind speed maxima collected at Bradfield, a high altitude location in the UK (1995–2004 inclusive). These plots reveal clear seasonal variability in the wind climate at Bradfield, as well as extremal serial correlation in both datasets. Table 1 (“Block maxima”) shows maximum likelihood estimates for three return levels when fitting to the set of 10 annual maximum wind speeds and the set of 36 monthly sea-surge maxima (we have just 3 years of sea-surge data so were required to use a block size smaller than the calendar year). Also shown (“Threshold excesses”) are the same estimates based on a POT analysis with \(\kappa =10 \,{\text {h}}\) and \(\kappa =30 \,{\text {h}}\) for the wind speeds and sea-surges respectively [\(\kappa =30\) to allow for wave propagation time; see Coles and Tawn (1991)].

Fig. 1
figure 1

Left-hand-side Newlyn sea-surge data; right-hand-side Bradfield wind speed data. Top time series plots; bottom plots of time series against series at lag 1, with thresholds. The green lines represent high thresholds used for identifying extremes

Mean excess plots [see Coles (2001), Chap. 4] were used to identify suitably high thresholds. To avoid issues of seasonal variability in the wind speed data, attention was restricted to extremes in the month of January wherein the largest wind speeds occur. In both analyses, the delta method [see, for example, Coles (2001), Chap. 2] has been used to obtain standard errors for \(\hat{z}_{r}\), although, owing to the severe asymmetry of the likelihood surface for return levels, confidence intervals have been obtained after having profiled the likelihood. The gain in precision by using a POT approach is obvious in the analysis of wind speeds. Of course, return level estimates can only be trusted if we have confidence in the fitted model from which we are extrapolating. The standard graphical diagnostics described in Coles (2001), including probability plots and quantile plots (not shown here), indicate suitable fits for both the block maxima and threshold excess analyses summarised in Table 1. In fact, further investigations revealed suitable fits of the GEV / GPD to block maxima / threshold excesses, respectively, across a range of block lengths / cluster termination intervals.

However, Fig. 2 shows the instability of return level estimates for the sea-surge data across different choices of block length / cluster termination interval. Most striking from these plots is the instability of the estimated 95 % confidence upper bounds: in block maxima analyses this increases by almost 17 m for \(\hat{z}_{1000}\) when increasing the block size from 1 to 2 months; in POT, similar changes are observed when \(\kappa \) increases from 10 observations to 26 observations. When a block maxima analysis and a POT analysis both indicate suitable fits, we might then appeal to estimation precision as a reason for adopting the latter. However, sensitivity of estimates to the choice of declustering interval \(\kappa \) [and to some degree the threshold u itself; see Scarrott and MacDonald (2012)], as illustrated here, should be noted.

1.4 Structure of this paper

The rest of this paper will be structured as follows. In Sect. 2 we investigate methods for increasing the precision of return level estimates by considering approaches for pressing all extremes into use. In particular, we consider threshold-based alternatives to POT via explicit modelling or quantification of extremal dependence, as well as using information on extremes from all seasons. Some of our recommendations here are supported with simulations. In Sect. 3 we then consider the Bayesian framework for return level inference. Again, the aim is to maximise data usage by properly accounting for dependence and seasonal variation. We also demonstrate the natural extension to prediction here, and present the results of a simulation study suggesting the superiority of the posterior predictive return level over a standard posterior summary.

Table 1 Maximum likelihood estimates of return levels
Fig. 2
figure 2

Maximum likelihood estimates (points) with associated 95 % profile log-likelihood confidence intervals (lines) for the 10-, 50- and 1000-year return levels for the Newlyn sea-surges. Top row results from an analysis of block maxima with block length \(\tau \); bottom row results from a POT analysis with declustering interval \(\kappa \)

2 Increasing the precision of estimated return levels

In this section we review some methods that have been proposed for increasing the precision of estimated return levels by exploiting, rather than removing (as in a POT analysis), any structure in the data owing to temporal dependence. In the case of the wind speed data, we also consider making use of extremes across all seasons—rather than simply the season within which the largest extremes are observed.

2.1 Serial correlation

2.1.1 Markov chain models

The POT approach for excesses over some high threshold u, as demonstrated in Sect. 1.3, has become standard practice in many areas of application. However, some authors (e.g. Smith et al. 1997; Fawcett and Walshaw 2006a) have explored the possibility of explicitly modelling within-cluster behaviour—an interesting exercise in its own right, in terms of the clustering characteristics of environmental series—but an approach which can allow the inclusion of all threshold excesses in the analysis. For example, based on the evidence given by plots such as those in the bottom row of Fig. 1, or perhaps inspection of the partial autocorrelation function, we might assume that our series \(X_{1}, X_{2}, \ldots \) forms a stationary first-order Markov chain, the stochastic properties of which being completely determined by the joint distribution of consecutive pairs. Given a model \(f(x_{i},x_{i+1}; {\varvec{\psi}})\) with parameter vector \({\varvec{\psi}}\), it follows that the likelihood for \({\varvec{\psi}}\) is:

$$ L({\varvec{\psi}} ) = \prod\limits_{{i = 1}}^{{n - 1}} f (x_{i} ,x_{{i + 1}} ;{\varvec{\psi}} )/\prod\limits_{{i = 2}}^{{n - 1}} f (x_{i} ;{\varvec{\psi}} ). $$
(4)

To model threshold excesses, the denominator in Eq. (4) is replaced by the corresponding univariate densities based on a limiting model for extremes, any marginal non-stationarity being handled via modelling of the parameters within this model (as discussed in Sect. 1.2). Bivariate extreme value theory is invoked for contributions to the numerator, of which we give a brief summary now for threshold excesses.

Suppose \((x_{1},y_{1}), (x_{2},y_{2}),\ldots ,(x_{n},y_{n})\) are independent realisations of a random vector (XY) with joint distribution function F. For suitably high \(u_{x}\) and \(u_{y}\), the marginals for \(X-u_{x}\) and \(Y-u_{y}\) both (approximately) take the form given by (2), with respective parameter sets \((\sigma _{x},\xi _{x})\) and \((\sigma _{y},\xi _{y})\), and with associated rates of threshold excess \(\lambda _{u_{x}}\) and \(\lambda _{u_{y}}\), respectively. Applying

$$\tilde{X} = -\left( {\text {log}}\left\{ 1-\lambda _{u_{x}}\left[ 1+\xi _{x}\left( \frac{X-u_{x}}{\sigma _{x}}\right) \right] ^{-1/\xi _{x}}\right\} \right) ^{-1}$$

to X (and similarly for Y), the variable \((\tilde{X},\tilde{Y})\) has distribution function \(\tilde{F}\) whose margins are approximately standard Fréchet for \(X>u_{x}\) and \(Y>u_{y}\) (see Coles 2001, Chap. 8). It can be shown (Pickands 1981) that the joint distribution function G(xy) for a bivariate extreme value distribution with standard Fréchet margins has the representation

$$G(x,y)= {\text {exp}}\big \{-V(x,y)\big \}, \quad x,y>0, \quad {\text {where}} $$
(5)
$$V(x,y)= 2 \int _{0}^{1}{\text {max}}\big (q/x,(1-q)/y\big )dW(q), $$
(6)

and W is a distribution function on [0, 1] satisfying

$$\int _{0}^{1}qdW(q)= \frac{1}{2}.$$
(7)

A popular choice of parametric families for G is the logistic family, with \(V(x,y) = (x^{-1/\alpha }+y^{-1/\alpha })^{\alpha }\); here, independence and complete dependence are achieved when \(\alpha =1\) and \(\alpha \rightarrow 0\) respectively. See the appendix of Fawcett and Walshaw (2012) for other choices for G. In a serial context, we would replace x / y with \(x_{i}\) / \(x_{i+1}\) respectively. Then contributions to the numerator in (4) can be obtained by differentiation of (5) with respect to both \(x_{i}\) and \(x_{i+1}\) if \((x_{i},x_{i+1})>u\), with appropriate censoring if one of either \(x_{i}\) or \(x_{i+1}\) lies sub-threshold. If \((x_{i},x_{i+1})\le u\) then the contribution to the numerator in Eq. (4) is given by the distribution function evaluated at the threshold. The marginal transformation to standard Fréchet and maximisation of the Markov chain likelihood can be performed in a single sweep, resulting in (4) being the full likelihood for both marginal and dependence parameters. Return levels can then be estimated on substitution of the estimated marginal parameters into Eq. (3); an estimate of the extremal index can be obtained from the estimated dependence parameter(s) from the bivariate extreme value model used—via simulation [as in Smith (1992) or Fawcett (2005)], or via a polynomial approximation for \(\theta \) (Fawcett and Walshaw 2012).

Parametric modelling of the dependence structure requires an appropriate choice of model, as well as a suitable choice of model order d. Coles and Tawn (1991) demonstrate some diagnostic procedures for assessing the suitability of a first-order dependence structure (\(d=1\)) relative to higher-order dependencies, but interpretation of ‘simplex plots’, for example, can be subjective. For \(d>1\) evaluation of the likelihood also becomes computationally expensive very quickly. Comparison of non-nested dependence models can require ad hoc checks of model goodness-of-fit, the interpretation of which can be subjective (e.g. Smith et al. 1997). More crucially, perhaps, is the assumption of asymptotic dependence when using (5). Of course, standard time series models with sub-asymptotic dependence (e.g. an AR(1) model) could be used instead, but graphical tools to assess the nature of the dependence (e.g. the \(\bar{\chi }\) dependence measure; Coles et al. 1999) can be difficult to interpret.

2.1.2 A non-parametric approach, with simulation study

Over the years there have been many publications on estimating the extremal index—for example, Leadbetter and Rootzén (1988); Smith (1992); Smith and Weissman (1994); Ancona-Navarrete and Tawn (2000); Ferro and Segers (2003); Süveges (2007); Fawcett and Walshaw (2008), Fawcett and Walshaw (2012). Most work has focused on exploration of within cluster behaviour and the the clustering characteristics of extremes. However, our aim within the remit of this paper would be to use the extremal index to aid, and improve, return level estimation: to increase precision by using information on all extremes, whilst at the same time avoiding altogether the issue of cluster identification necessary in POT analyses. In fact, this is explored in Fawcett and Walshaw (2012), and simulations here reveal some promising results when specific estimators for \(\theta \) are considered.

Appendix 1 summarises some common methods of extremal index estimation. Fawcett and Walshaw (2012) show that quantifying the degree of extremal dependence through the intervals estimator of Ferro and Segers (2003), and incorporating this estimate of \(\theta \) in the estimation of return levels via Eq. (3), can more than double the estimation precision of return levels relative to estimates obtained from a standard POT analysis. However, Fawcett and Walshaw (2012) fail to assess the suitability of the other intervals estimators given in Appendix 1. We now present some results of a simulation study to assess the performance of various extremal index estimators and their ability to aid return level estimation; an extension of that in Fawcett and Walshaw (2012) but now including all of the estimators summarised in the Appendix. We also allow for processes other than those which assume asymptotic dependence. These estimators assume stationarity, and so any seasonal variability, for example, needs to be dealt with prior to estimation.

We simulate 1000 chains, each of length 10, 000, from several processes and with a range of serial correlations. Specifically, we simulate first-order extreme value Markov chains, as discussed in Sect.  2.1.1, using the (symmetric) logistic / negative logistic models, as well as the (asymmetric) bilogistic model; we simulate max-autoregressive processes, defined by

$$ X_{i} = {\text {max}}\big \{(1-\theta )X_{i-1},\theta Z_{i}\big \}, \qquad i=1, 2, \ldots , $$

where \(X_{0}\) and \(Z_{i}\) are standard Fréchet distributed (see Sect.  2.1.1); we also simulate Gaussian AR(1) processes, defined by

$$X_{i} = \psi X_{i-1}+\varepsilon _{i}, \qquad i=1, 2, \ldots , $$

where \(\varepsilon _{1}, \varepsilon _{2}, \ldots \) are IID Normal \(N(0,1-\psi ^{2})\) random variables with \(X_{0}\) being standard Normal. Smith (1992) discusses how the extremal index for first order extreme value Markov chains can be obtained via simulation; however, Fawcett and Walshaw (2012) exploit the deterministic relationship between \(\theta \) and the parameter(s) in the bivariate extreme value model used to obtain simple polynomial forms here. The AR(1) process exhibits serial dependence but limiting extremal independence, and so here \(\theta =1\).

After marginal transformation of our chains to \({\text {GPD}}(\sigma =1,\xi )\), maximum likelihood is used to fit the GPD to excesses above a threshold u, set at the 95 % marginal quantile. Due to the threshold stability property of the GPD (see Coles (2001), Chap. 4), these excesses will be generalised Pareto with scale \(\sigma ^{*}=\xi u+1\) and shape \(\xi \), and so at each replication \(j=1, \ldots , 1000\) we will obtain \((\hat{\lambda }_{u},\hat{\sigma }^{*},\hat{\xi })^{(j)}\), \(\hat{\lambda }_{u}\) being the observed rate of threshold excess. Using the methods in Appendix 1, at each replication j we also estimate the extremal index, giving \(\hat{\theta }^{(j)}\); with the estimated marginal parameters, an estimate of the r-year return level \(\hat{z}_{r}^{(j)}\) can then be obtained via Eq. (3) (we use \(n_{y}=365.25 \times (24/3) = 2922\) in keeping the Newlyn sea-surge data). At each replication, the GPD is also fitted to the set of cluster peak excesses, extracted using runs declustering with various values of \(\kappa \)—in doing so, we can compare the standard POT approach, wherein \(\theta \approx 1\), to the method which makes use of all threshold excesses.

Tables 2 and 3 summarise results from the simulation study for extremal index estimators and return level estimates, respectively, for \(\xi =-0.4\) and certain levels of extremal dependence (other values for \(\xi \), and other levels of extremal dependence, were also used—with similar findings obtained). Table 2 shows that for all simulated processes, there is a larger discrepancy between the sampling distribution mean and the true value for \(\theta \) when using the cluster size methods than when using any of the intervals or maxima methods, and that the cluster size methods themselves are highly sensitive to the choice of cluster separation interval \(\kappa \). The cluster size estimators also consistently have a higher root mean squared error (RMSE) than all the other estimators (although not shown here, similar findings were obtained for the blocks estimator of \(\theta \)). Although the maxima methods require the determination of a suitable block size \(\tau \), using \(\tau =\sqrt{n}\) seems to have produced reasonable estimates for the extreme value Markov chain and the max AR process. However, for these two processes the Ferro and Segers (2003) estimator and the K-gaps estimator of Süveges and Davison (2010) are superior when considering their estimated bias and RMSE; for both processes, the mean of the sampling distribution using the K-gaps estimator is closest to the true value for \(\theta \) and the RMSE is smallest—although optimal values for the tuning parameter K have been used, following investigations in Süveges and Davison (2010), and this might be difficult to do in practice. There appears to be much larger bias in estimates of the extremal index for the AR(1) process than for the other two processes studied. However, as discussed in Ancona-Navarrete and Tawn (2000), the cluster size and intervals estimators are actually estimating \(\theta (u)\) rather than \(\theta \), a threshold-based extremal index provided by a ‘penultimate’ expression for \(\theta \). In fact, Ancona-Navarrete and Tawn (2000) find that, for a marginal 95 % threshold (as used here), \(\theta (u) \approx 0.711\) for an AR(1) process (\(\theta (u)\approx \theta \) for the other two processes used here). For a comparison of the performance of these estimators in the Bayesian framework, see Fawcett (2005).

Table 3 shows that return level estimates are less biased when using all threshold excesses, relative to a standard POT approach, regardless of the extremal index estimator used to quantify extremal dependence—and increasingly so as the return period gets larger. For all but the 10-year return level, the RMSE is larger in the standard POT approach. For analyses using all threshold excesses, results are shown for the main contenders in terms of extremal index estimation (from Table 2), and there is little to distinguish between them—although return level estimates obtained using the K-gaps estimator have smaller bias and RMSE for all return periods considered. However, given the need to choose an appropriate block size \(\tau \) for the maxima methods, and the tuning parameter K in the K-gaps method—both of which could be difficult to do in practice—we recommend using the intervals method of Ferro and Segers (2003) which provides a completely automatic solution to extremal index estimation. The results shown in Table 3 are for an extreme value Markov chain, but similar findings were also observed for the other two processes studied, and for different levels of extremal dependence.

2.2 Seasonal variability

The wind speed data observed at Bradfield exhibit clear seasonal variability, with the strongest gusts being observed in the winter months—particularly January (hence the restriction to the month of January in the analysis of Sect. 1.3). Experience suggests that, in the UK at least, assuming the calendar month as our seasonal unit satisfactorily reflects the changing nature of the wind climate, whilst resulting in approximate homogeneity within seasons. A modelling approach that identifies all gusts which are large given the time of year as extreme has the potential to increase estimation precision, relative to an approach using only data from a single season. Walshaw (1994) justifies using wind speed extremes from summer months in the UK: he points out that the same mechanism (an alternating sequence of anticyclones and depressions) is responsible for large wind speeds throughout the year—it is just the severity of these systems which gives rise to variations month-by-month. Such an argument supports the use of a seasonal piecewise approach for handling such variation, whereby a different model is fitted to extremes in each month. In the context of threshold models, we could follow the analysis of January wind speeds demonstrated in Sect. 1, but repeat the entire estimation procedure for extremes in all other months. Then, assuming independence between months, the monthly-varying GPD parameter estimates can be recombined when obtaining return level estimates by solving the following equation for \(x=z_{r}\):

$$ \prod _{m=1}^{12}H_{m}(x)^{n_{m}\theta _{m}} = 1-r^{-1}, $$
(8)

where \(H_{m}\) is the GPD distribution function in month m with parameter set \((\lambda _{u_{m}},\sigma _{m},\xi _{m})\), and \(\theta _{m}\) / \(n_{m}\) are the extremal index / number of observations in month m.

Table 2 Sampling distribution means, and root mean squared errors (RMSE), for various estimators of the extremal index \(\theta \), and for three different types of process
Table 3 Estimated bias and root mean squared error (RMSE) of return level estimates \(\hat{z}_{r}\) for three return periods

This monthly-varying GPD approach can be adapted to suit seasonal units of any size (depending on the data being analysed), although other methods for handling seasonal variability have been proposed, including the use of Fourier forms to allow the model parameters to vary continuously through time (as demonstrated in Coles 2001). However, most of these methods are computationally burdensome relative to the seasonal piecewise approach and, as Walshaw (1991) illustrates, can add little to return level inference in terms of accuracy and precision. Fawcett and Walshaw (2006b) also investigate the use of a conditional autoregressive structure to allow dependence between wind speed extremes in neighbouring months at Bradfield; again, they find no improvement in return level estimation by doing so. Work in Fawcett (2005) suggests significant differences in the GPD scale and shape for wind speed extremes in different months at Bradfield; often, to reduce over-fitting and where it is deemed appropriate to do so, a constant shape parameter is assumed.

2.3 Other forms of non-stationarity

As discussed so far, both our sea-surge and wind speed extremes are serially dependent, with the wind speed data also exhibiting seasonal variability. Across the time-frames studied, neither seem to display any temporal trend, although in many environmental series this departure from stationarity is an issue. A simple approach here could be to allow a linear / non-linear dependence of the extremal model parameter(s) on a time index. As discussed in Sect. 1.2, a dependence on other covariates can be incorporated in a similar fashion. Generally, pragmatic approaches have been developed to deal with the specific form of non-stationarity observed. For example, Chavez-Demoulin and Davison (2005) use smooth non-stationary general additive models for extremes, in which spline smoothers are incorporated into the GPD; Fawcett (2005) and Eastoe (2009) demonstrate a data pre-processing approach for dealing with seasonality and trend; Atyeo and Walshaw (2012) account for spatial dependence and temporal trend in a region-based hierarchical model for UK rainfall extremes; Jonathan and Ewans (2011) account for dependence between marginal extremes of significant wave height and wave direction / season; Coles and Walshaw (1994) propose a directional model for extreme wind speeds in the UK. For a more comprehensive review, see Jonathan et al. (2014).

2.4 Application to sea-surge and wind speed extremes

We now demonstrate the methods outlined in Sects. 2.1 and 2.2 by application to the Newlyn sea-surges and Bradfield wind speeds. We assume stationarity in the sea-surge data, but deal with seasonal variability in the wind speed extremes observed at Bradfield by adopting the seasonal piecewise approach as discussed in Sect. 2.2.

Considering the Markov chain model approach outlined in Sect.  2.1.1, Fawcett and Walshaw (2006a) provide a detailed investigation of the suitability of a first-order extreme value Markov assumption for the monthly-varying wind extremes. Plots of the \(\chi \) and \(\bar{\chi }\) dependence measures (see, for example, Coles 2001, Chap. 8) suggest asymptotic dependence, providing some justification for using models from bivariate / multivariate extreme value theory for the temporal evolution of the process. Using a likelihood ratio test reveals that the bilogistic model, allowing for asymmetry in the dependence structure, gives no significant improvement over the simpler (symmetric) logistic model (see Sect. 2.1.1) when assuming first-order dependence only; although further graphical diagnostics suggest a second-order dependence assumption might be more suitable, Fawcett and Walshaw (2006a) reveal that inferences for return levels barely change when the likelihood in (4) is extended to allow for longer-range dependencies. The estimated value of the logistic dependence parameter in each month m, \(\alpha _{m}\), can then be used to find the corresponding estimate of the extremal index \(\theta _{m}\) via the cubic approximation derived in Fawcett and Walshaw (2012):

$$\theta\approx 0.013- 0.092\alpha +1.833\alpha ^{2}-0.756\alpha ^{3}. $$
(9)

Then, with the monthly-varying marginal GPD parameter estimates, these monthly-varying estimates of the extremal index can be used to estimate return levels on solution of Eq. (8) for \(x=z_{r}\). Estimates of the 10-, 50- and 1000-year return levels, with associated standard errors, are shown in Table 4. Standard errors for \(\hat{\theta }_{m}\) (not shown here) have been obtained via the delta method, as have the standard errors for the estimated return levels; we have assumed that all covariances between dependence and marginal parameters are zero. Exactly the same procedure has been used to fit an appropriate Markov chain model to the Newlyn sea-surge extremes, but without the added complexity of seasonally-varying marginal and dependence parameters. Although a first-order dependence structure once again seemed adequate, the bilogistic model showed significant improvement over the logistic model for the sea-surge extremes; the polynomial approximation of the extremal index, derived in Fawcett and Walshaw (2012) as a function of the dependence parameters in the bilogistic model, was used to estimate the extremal index. Once again return level estimates, with standard errors in parentheses, are shown in Table 4.

Also shown in Table 4 are estimated return levels from analyses in which no parametric form for the dependence structure has been assumed; Ferro and Segers’ intervals estimator, and the IWLS estimator of Süveges, have been used to estimate the extremal index, being our recommendations from the simulation study of Sect. 2.1.2 (note that both assume stationarity, which has been accounted for here in the wind speeds analysis). A block bootstrap procedure has been used to obtain the standard errors for these estimates [see Fawcett and Walshaw (2012) for full details]. For information, and for comparison with the methods making use of all threshold excesses, we have also reported return level estimates obtained under a standard POT analysis. For the sea-surge data, these are exactly the estimates given earlier in Table 1; for the Bradfield data, the POT estimates shown in Table 4 are those obtained from a seasonal piecewise approach for dealing with monthly variations in extreme wind speeds. Here, we have made use of reclustered excess plots (Walshaw 1994) to simultaneously identify monthly varying thresholds and cluster separation intervals \((u_{m},\kappa _{m}), m=1, \ldots , 12\).

The advantage of making use of all threshold excesses is obvious when we compare the standard errors of the estimated return levels, these being considerably smaller than those obtained from the POT analyses. In fact, we would advise the use of the non-parametric approach in practice, as this does not require the exploratory analyses of the dependence structure that the Markov chain models require. Although the standard errors shown in Table 4 are useful for highlighting the gain in precision when using all threshold excesses, as discussed throughout Sect. 2 we would probably rather not use these standard errors to construct symmetric confidence intervals. Instead, we recommend using a block bootstrap procedure, as outlined fully in Fawcett and Walshaw (2012), Sect. 4.3. Doing so, we construct B bootstrap replications of our process, yielding a collection of estimates \(\{z_{r}^{(1)}, \ldots , z_{r}^{(B)}\}\), from which we can obtain bias-corrected, accelerated (\({\text {BC}}_{\text {a}}\)) confidence intervals as proposed in Efron (1987). Fawcett and Walshaw (2012) show that such intervals give estimated coverage probabilities closer to the intended coverages than do the simpler percentile intervals. Implementing such a bootstrap scheme for the Bradfield wind speeds and Newlyn sea-surges gives confidence intervals for return levels that are appreciably narrower than those shown in Table 1; for example, the 95 % profile-likelihood confidence interval for the 50-year sea-surge at Newlyn, obtained via the standard POT approach with \(\kappa =30\) h, is (0.80, 2.09) m (see Table 1); the corresponding 95 % \({\text {BC}}_{\text {a}}\) interval, using all threshold excesses and Ferro and Segers’ intervals estimator for \(\theta \), is (0.71, 1.02)  m. Similar comparisons are made when using Süveges’ IWLS estimator for \(\theta \) using all threshold excesses. For more details, see Fawcett and Walshaw (2012).

Table 4 Estimates of the 10-, 50- and 1000-year return levels for the wind speeds at Bradfield and the sea-surges at Newlyn

3 Bayesian inference for extremes

The primary aim of this paper is to find an optimal approach for estimating return levels. To this end, we have considered methods for increasing the accuracy and precision of our estimates. Working within the Bayesian framework lends further potential here. As we will demonstrate in Sect. 3.2, complex model structures can easily be estimated via Markov chain Monte Carlo (MCMC); specifically, we allow the sharing of information between sites and across seasons to increase the precision of our return level estimates. The natural extension to the posterior predictive distribution might also be useful for practitioners, the predictive return level giving a single point estimate incorporating uncertainty in parameter estimation and randomness in future observations. Although not fully realised in this paper, there is also the potential to increase estimation precision still further through the inclusion of expert-informed prior distributions.

The Bayesian paradigm was quite late to be adopted by statisticians working on extreme value theory and methods. For some general background, Coles (2001), Chap. 9 devotes a section to this topic, while Stephenson and Tawn (2004) review the literature in a paper which focuses on accounting for the three extremal types. Coles and Powell (1996) carry out a comprehensive review of the literature up to that date, and analyse wind data from a number of locations in the USA by constructing a prior for the GEV parameters based on estimates obtained at other locations. Among the other significant contributions, Coles and Tawn (1996) use expert knowledge to construct a multivariate prior for the GEV parameters, and Smith and Walshaw (2003) extend this idea to bivariate distributions for extreme rainfall at pairs of locations within a region. Smith (1999) considers predictive inference under the Bayesian and frequentist paradigms, and Smith and Goodman (2000) and Bottolo et al. (2003) construct Bayesian hierarchical models for extreme values in insurance problems. Fawcett and Walshaw (2006b) model extreme wind speeds in a region of the UK using a Bayesian hierarchical model. Fawcett and Walshaw (2006a) consider Bayesian inference for Markov chain models (also for extreme wind speeds) using a simulation framework similar to that used by Smith et al. (1997) to obtain estimates of the extremal index. More recently, Sang and Gelfand (2009), Sang and Gelfand (2010) and Davison et al. (2012) demonstrate the use of Bayesian hierarchical models for environmental data which allow for spatial structure in the extremes.

In the absence of any prior specification for the parameters in an extremal model (e.g. the GEV or the GPD; see Sect. 1.2), it is possible to perform an analysis within the Bayesian framework through the use of objective priors (sometimes referred to as, quite misleadingly, ‘uninformative’, ‘non-informative’ or ‘default’ priors). This might also be a preferred approach if the complexity of the model makes inferences difficult or more cumbersome within a standard frequentist setting. Indeed, we discuss this in the context of the GPD (log) scale and shape parameters, and the logistic dependence parameter, in Sect. 3.1.1, where simple, independent, diffuse priors are suggested. However, a more thoughtful development of objective priors for extreme value models is given in Beirlant et al. (2004), wherein maximal data information (MDI) priors and Jeffreys’ priors for the GPD are considered; similarly, Eugenia Castellanos and Cabras (2007) investigate the use of a Jeffreys prior for the GPD. Ho (2010) and Cabras (2013) develop probability matching priors for the GPD, and Northrop and Attalides (2014) investigate posterior propriety for Jeffreys’, MDI and uniform priors for the GEV and GPD.

3.1 Example: wind speed extremes at Bradfield

3.1.1 Prior specification

In keeping with the spirit of this paper, we aim to make use of information on all threshold exceedances to maximise the precision of our return level estimates. Consider the likelihood in Eq. (4), with parameter vector \({\varvec{\psi}} = (\eta _{m},\xi _{m},\alpha _{m})\) for wind speed excesses over \(u_{m}\) in month m, \(m=1, \ldots , 12\), where

$$\eta _{m} = {\text {log}}(\sigma _{m}-\xi _{m}u_{m}) $$

and \(\xi _{m}\) are the GPD (log) scale and shape, respectively, and \(\alpha _{m}\) is the logistic dependence parameter for the first-order evolution of the process. As outlined in Sect. 2.2, the nature of the wind climate in the UK justifies the seasonal piecewise approach used. In the Bayesian context, the re-parametrisation of the GPD scale to \((\sigma _{m}-\xi _{m}u_{m})\) gives a parameter which is threshold-independent, allowing the specification of an objective prior for the scale at all threshold levels; working with the natural logarithm of this re-parametrised scale retains the positivity of this parameter in the MCMC sampling scheme. In the absence of any expert prior information, then, we could specify the following independent, diffuse priors for the elements of \({\varvec{\psi}}\):

$$\eta _{m}\sim N(0,10^{4}), \quad \xi _{m}\sim N(0,10^{2}), \quad \alpha _{m}\sim U(0,1),$$
(10)

\(m=1, 2, \ldots , 12\). We might expect such distributions to reflect our prior uncertainty about the marginal / dependence parameters and, in accord with the findings of Coles and Tawn (2005), we find that inferences barely change under order of magnitude changes to the variance specifications in (10). However, an investigation into the dependence structure of wind speed extremes at a location close to Bradfield (see Fawcett 2005) suggests a logistic dependence parameter of around \(\alpha _{m}\approx 1/3\) for all m. Thus, we consider independent Beta(10, 19) priors for \(\alpha _{m}\), the variability of which we believe adequately reflects our knowledge about the dependence of consecutive wind speed extremes at Bradfield, including any uncertainty about differences in the dependence structure of extremes between the two locations. Similarly, from information gathered at this nearby location, we can specify the following bivariate Normal prior distributions for \((\eta _{m}, \xi _{m})\) at Bradfield:

$$(\eta _{m},\xi _{m}) \sim N_{2}\left( {\varvec{\mu}}_{m}, {\varvec{\Sigma}}_{m}\right) , \quad m=1, \ldots , 12.$$

The components of \({\varvec{\mu}}_{m}\) are chosen to closely match our beliefs about what are the most likely values of \((\eta _{m},\xi _{m})\) based on our study of monthly wind speeds at the nearby location. We specify values for \({\mathsf{cov}} (\eta _{m},\xi _{m})\) according to our beliefs regarding the covariances between these parameters at the nearby location, scaled (albeit rather crudely) to reflect our uncertainty about differences between monthly wind speed extremes at the two locations.

3.1.2 Bayesian sampling

After setting initial values for the elements in \({\varvec{\psi}}\) (we use the prior means), a simple Metropolis stepFootnote 2 is used to generate successive draws from the posterior distribution, giving \((\eta _{m}^{[j]}, \xi _{m}^{[j]}, \alpha _{m}^{[j]})\) at each iteration j, \(j=1, \ldots , 50,000,\) in the sampler. Specifically, within each Metropolis step, a random walk procedure is used to generate candidate values for each of the parameters, the variances of the innovations being tuned to maximise the efficiency of the algorithm (achieving an overall acceptance probability of around 23 %; see Roberts et al. 1997, for a discussion of desirable acceptance probabilities). Such MCMC sampling schemes can be easily implemented using the evdbayes package in R (Stepheson and Ribatet 2014), including tuning of the acceptance probabilities and convergence diagnostics.

The bilogistic model, with dependence parameters \((\alpha _{m},\beta _{m})\), or indeed any of the standard models for extremal dependence, can be used in place of the logistic model. In the frequentist analysis of Sect. 2.3, a likelihood ratio test revealed that the bilogistic model, allowing for asymmetry in the dependence structure, gives no significant improvement over the simpler logistic model; in the Bayesian analysis, regardless of our choice of suitable (but independent) priors for \(\alpha _{m}\) and \(\beta _{m}\), the 95 % credible intervals for \((\alpha _{m}-\beta _{m})\), \(m=1, \ldots , 12\), covered zero—suggesting agreement with the frequentist analysis (the bilogistic model reduces to the symmetric logistic model when \(\alpha _m=\beta _m\)). Other posterior predictive checks, such as those demonstrated in Fawcett and Walshaw (2006a), can be used to assess model suitability.

At each iteration j in the MCMC algorithm, the current posterior draw for the logistic dependence parameter \(\alpha _{m}^{[j]}\) is used to obtain a posterior draw for the extremal index via the cubic approximation in Eq. (9), giving \(\theta _{m}^{[j]}\). Then, a corresponding draw from the posterior for various return levels \(z_{r}^{[j]}\) can be obtained on solution of Eq. (8) for \(x=z_{r}^{[j]}\), after substitution of \(\sigma _{m}\), \(\xi _{m}\) and \(\theta _{m}\) with \((e^{\eta _{m}^{[j]}}+\xi _{m}^{[j]}u_{m})\), \(\xi _{m}^{[j]}\) and \(\theta _{m}^{[j]}\), respectively; \(\lambda _{u_{m}}\) is fixed at the observed proportion of exceedances of \(u_{m}\) in each month m. The MCMC sample paths (not shown here) showed rapid convergence to their apparent stationary distributions, with good mixing properties (more formal convergence monitoring diagnostics are available—see, for example, Brooks and Gelman 1998). After removal of the burn-in period (the first 2000 MCMC draws), we are left with \(S=48,000\) posterior draws on which to make inferences. Table 5 (“Standard analysis”) shows posterior summaries for the 10-, 50- and 1000-year return levels for wind speeds at Bradfield after the removal of the burn-in period. Relative to using the uninformative priors in (10) (results not shown), we observe smaller posterior standard deviations; notice also that these posterior standard deviations are smaller than the estimated standard errors obtained in the frequentist analysis of Sect. 2.3 (see Table 4). Credible intervals in the Bayesian context (see Table 5) are also more readily available, obtained by direct reference to the posterior draws for \(z_{r}\).

Table 5 Posterior means (standard deviations) and 95 % credible intervals in parentheses, for the 10-, 50- and 1000-year return levels from Bayesian analyses of the Bradfield wind speed extremes
Fig. 3
figure 3

Predictive return level curve (bold line) for Bradfield. Also shown, for comparison, are posterior means for some standard return levels with their 95 % credibility bands

3.1.3 Predictive inference

Suppose we assume the same marginal and dependence structure for future extremes Y of our monthly wind speed processes at Bradfield. Allowing for uncertainty in parameter estimation and future observations, we can write

$${\text {Pr}}\left\{ Y\le y| {\varvec{x}}\right\}= \int _{\varvec{\Psi }}{\text {Pr}}\big \{Y \le y|{\varvec{\psi}}\big \}\pi ({\varvec{\psi}}|{\varvec{x}})d{\varvec{\psi}}$$
(11)

for the predictive distribution of our wind speed extremes, where \({\varvec{x}}\) represents past observations. Solving

$${\text {Pr}}\left\{ Y \le z_{r,{\mathsf{pred}}}|{\varvec{x}}\right\}= 1 - r^{-1} $$
(12)

for \(z_{r, {\mathsf{pred}}}\) therefore gives an estimate of the r-year return level that incorporates uncertainty due to model estimation. Although (11) is analytically intractable, it can be approximated since we have estimated the posterior distribution using MCMC. Regarding our sample \({\varvec{\psi}}^{(1)}, \ldots , {\varvec{\psi}}^{(S)}\) as realisations from the stationary distribution \(\pi ({\varvec{\psi} }|\varvec{x})\), we have

$${\text {Pr}}\left\{ Y \le z_{r, {\mathsf{pred}}}|{\varvec{x}}\right\}\approx \frac{1}{S}\sum _{j=1}^{S}{\text {Pr}}\left\{ Y \le z_{r, {\mathsf{pred}}}|{\varvec{\psi} }^{[j]} \right\} , $$
(13)

which we can set equal to \(1-r^{-1}\) and solve for \(z_{r,{\mathsf{pred}}}\) using a numerical solver. These values are shown in Table 5 for \(r=10\), 50 and 1000. Figure 3 compares predictive and estimative return levels across a range of values of r, showing that, for very long-range return periods, even designing a structure to the upper end-point of the Bayesian 95 % credible interval might result in under-protection, relative to estimates obtained in the predictive analysis.

3.1.4 Non-parametric approaches for serial dependence

In the earlier frequentist analyses, we advocated the use of non-parametric estimators (e.g. Ferro and Segers’ intervals estimator) for the extremal index rather than a Markov chain model as used in this section. In the absence of a likelihood for the extremal index, such non-parametric methods are difficult to implement within a Bayesian sampling scheme. Ferro and Segers (2003) do propose a maximum likelihood estimator for the extremal index based on their inter-arrival times methodology. However, the model, based on a mixture distribution, one component of which is an exponential distribution with rate \(\theta \), assigns all of the inter-exceedance times to the exponential component as \(n \rightarrow \infty \) (where n is the length of the process), a feature illustrated when using the associated likelihood as an ingredient in Bayesian inference for \(\theta \) in Fawcett (2005): the effect of using this likelihood is a posterior distribution for \(\theta \) that converges to a point mass at 1, regardless of the strength of serial correlation present.

Süveges (2007) also suggests a likelihood for \(\theta \) (the corresponding maximum likelihood estimator is demonstrated in the simulation study of Sect. 2.1.2 of this paper); however, Table 2 reveals substantial bias when the underlying process is an extreme value Markov chain. The K-gaps estimator of Süveges and Davison (2010) is likelihood-based, and as we show in the simulation study of Sect. 2.1.2 it performs well when K is chosen optimally. Indeed, since the first-order extreme value Markov chain assumption, with logistic dependence, seems reasonable for our monthly wind speeds data, this could have been tried here; however, more generally it might be difficult to choose a value for K which lends optimal performance to this estimator of the extremal index. Fawcett and Walshaw (2008) demonstrate the use of a GEV likelihood which incorporates \(\theta \), proposed by Ancona-Navarrete and Tawn (2000), as an ingredient for Bayesian inference for \(\theta \), although this approach is sensitive to the block size \(\tau \) that must be chosen. The semi-parametric estimator of Northrop (2012) is also based on a likelihood and so is an additional possibility in this context, although once again a tuning parameter (again the block size \(\tau \)) must be chosen carefully. Thus, for Bayesian inference, we recommend using a suitable parametric form for the dependence structure in the extremes, as demonstrate in Sects. 3.1.1, 3.1.2, and 3.1.3.

3.2 Spatial considerations

In Sect. 3.1 we demonstrated the advantages of a Bayesian approach to return level inference through a basic application to the wind speed data at Bradfield. Even a rather crude attempt to incorporate prior knowledge into the analysis resulted in estimates of posterior variability that were substantially smaller than the asymptotic standard errors in the corresponding frequentist analysis. Prediction is also handled neatly within the Bayesian framework, as illustrated in Sect. 3.1.3—estimates of predictive return levels are potentially appealing to practitioners, as they account for uncertainty due to model estimation and uncertainty in future observations. Another advantage of working within the Bayesian framework is the relative ease with which we can build more complex, and potentially realistic, model structures, as we now demonstrate. In the following application, return level estimation precision is increased still further.

Fawcett and Walshaw (2006b) develop a hierarchical model for extreme wind speeds observed at 12 locations in central / eastern England (Bradfield, as used throughout this paper, being one of these sites). In an attempt to share information across sites and seasons, they specify the following model structure for GPD scale and shape parameters, and the logistic dependence parameter, as used throughout Sect. 3.1:

$$\begin{aligned} \eta _{m,s}= & {} \gamma _{\eta }^{(m)}+\varepsilon _{\eta }^{(s)},\\ \xi _{m,s}= & {} \gamma _{\xi }^{(m)}+\varepsilon _{\xi }^{(s)} \quad {\text {and}} \\ \alpha _{s}= & {} \varepsilon _{\alpha }^{(s)}, \end{aligned}$$

where, generically, \(\gamma \) and \(\varepsilon \) represent seasonal and site effects respectively, \(m=1, \ldots , 12\) being an indicator of season (month), and \(s=1, \ldots , 12\) being an indicator of site. All random effects for \(\eta _{m,s}\) and \(\xi _{m,s}\) were taken to be normally and independently distributed; the means and variances of the random effects distributions were given distributions that were thought to reasonably reflect prior ignorance about the seasonal and site effects, whilst retaining conjugacy wherever possible to simplify the MCMC sampling scheme. The logistic dependence parameter \(\alpha \) was allowed to vary by site only (an a priori assumption justified by the nature of the wind climate across seasons within the UK; the analysis in Sect. 3.1 also revealed similarity in \(\alpha _{m}\) across all months m), but a U(0, 1) prior was used for \(\varepsilon _{\alpha }^{(s)}\) to reflect prior ignorance about the dependence structure for wind speed extremes for each site as a whole (of course, more informative priors, as specified in Sect. 3.1.1 for the Bradfield wind speeds, could have been used). Where conjugacy facilitated specification of full conditional distributions, Gibbs sampling was used (i.e. to obtain draws from the posterior distributions of the parameters in the random effects distributions); a Metropolis step, as discussed in Sect. 3.1.2, was used elsewhere. See Appendix 2 for more details, including the full conditional distributions used in the Gibbs sampler.

Posterior summaries of return levels, at Bradfield, are shown in Table 5 (“Hierarchical model”). The effect of sharing information on extremes at other sites can be seen in the reduction of posterior variability relative to the standard Bayesian analysis (which uses information at Bradfield only, although information from a neighbouring site is used to aid prior specification).

Although Fawcett and Walshaw (2006b) demonstrate the ease with which more complex hierarchical models can be fitted within the Bayesian framework, they do not account for any spatial structure; that is, in the model hierarchy outlined above, sites are exchangeable, an over-simplification which can be addressed by adopting a parametric form to govern the spatial dependence between extremes observed at multiple sites within a region. To this end, Davison et al. (2012) consider using Gaussian processes (after suitable marginal transformations), with standard correlation functions from the geostatistics literature [e.g. Diggle and Ribeiro 2007] to represent the decay in dependence between extremes at a pair of sites with distance. Within the Bayesian context, they also consider latent variable models for rainfall extremes observed at a network of sites across a region in Switzerland, using the co-ordinates of these sites as covariates to allow interpolation of extremes at locations for which no rainfall measurements were made. On a completely continuous scale, this allows the production of ‘heat maps’, wherein estimated return levels can be displayed smoothly for all points within a region simultaneously. Davison et al. (2012) also consider max-stable models for spatial dependence, making use of the multivariate extension of Eq. (5) and the various models for extremal dependence discussed. Currently, spatial models are a hot topic of research in the field of extremes, the implementation of which might be accessible to practitioners through the development of R packages such as CompRandFld (Padoan and Bevilacqua 2013).

3.3 Predictive inference: simulation study

Throughout this section we have demonstrated the natural extension of Bayesian inference to prediction. In particular, we have discussed the potential appeal of the predictive return level to practitioners; inference on this quantity provides a design parameter estimate with uncertainty in parameter estimation and future observations ‘built in’. We now compare the sampling properties of \(z_{r, {\mathsf{pred}}}\) to those of two commonly-used point estimates from the posterior distribution of \(z_{r}\) through a simulation study. Following the Bayesian analyses of wind speed extremes at Bradfield detailed in this section, we simulate large ‘master’ datasets from the seasonal piecewise model (see Sect. 3.1) and the hierarchical model (see Sect. 3.2). Specifically, we use \((\bar{\sigma }_{m},\bar{\xi }_{m},\bar{\alpha }_{m})\), the posterior means of the GPD parameters and logistic dependence parameter in the seasonal piecewise model, to simulate 10,000 wind speed extremes in each month m, \(m=1, \ldots , 12\); for the hierarchical model, we use \((\bar{\sigma }_{m,s},\bar{\xi }_{m,s},\bar{\alpha }_{s})\) for each month m and site s, \(m,s=1, \ldots , 12\). Simulating 10,000 extremes in each month gives around 30 times as many simulated extremes as we have actual observed extremes at Bradfield. Large MCMC runs are then applied to these master datasets to obtain estimates of predictive return levels at Bradfield, these estimates being treated as the true values of \(z_{r,{\mathsf{pred}}}\). Specifically, Eq. (13) is solved for \(z_{r,{\mathsf{pred}}}\) using, for example, \({\varvec{\psi} }^{[j]}=\left( \sigma _{m}, \xi _{m}, \theta _{m}\right) ^{[j]}\) in the seasonal piecewise model, where \(\theta _{m}\) is obtained from the posterior draw for \(\alpha _{m}\) via Eq. (9) and \(j=1,\ldots , 10^7\) after the removal of burn-in. Similarly, the means of the posterior draws for \(z_{r}\) from these large MCMC runs, obtained by solving Eq. (8) for \(x=z_{r}^{[j]}\) using \({\varvec{\psi }}^{[j]}\), \(j=1, \ldots , 10^7\), are taken to be the true posterior means for \(z_{r}\), which we label as \(z_{r, {\mathsf{mean}}}\). We also obtain \(z_{r, {\mathsf{upper}}}\), the 97.5 % empirical quantile of \(z_{r}^{[j]}, j=1, \ldots , 10^7\) (i.e. the upper endpoint of the 95 % credible interval for \(z_{r}\), often used as a design parameter in practice).

We simulate N years of wind speed extremes from each of the seasonal piecewise and hierarchical models, using \((\bar{\sigma }_{m},\bar{\xi }_{m},\bar{\alpha }_{m})\) and \((\bar{\sigma }_{m,s},\bar{\xi }_{m,s},\bar{\alpha }_{s})\) respectively and with the same number of simulated extremes as were observed at Bradfield (and the other sites in the hierarchical model). We then find \(P_{r, {\mathsf{pred}}}\), \(P_{r, {\mathsf{mean}}}\) and \(P_{r, {\mathsf{upper}}}\)— the proportion of years in which the maximum simulated extreme exceeds \(z_{r, {\mathsf{pred}}}\), \(z_{r, {\mathsf{mean}}}\) and \(z_{r, {\mathsf{upper}}}\) (respectively). This exercise is repeated L times in order to assess the variability in our estimates of these proportions. We use \(N=10,000\) and \(L=1000\). We also repeat the entire simulation procedure for other strengths of extremal dependence, and for other dependence models. For example, for the Bradfield wind speed extremes most \(\bar{\alpha }_{m}\) were around 0.3; we also consider \(\bar{\alpha }_{m}=0.5\) and \(\bar{\alpha }_{m}=0.75\). We also consider the case of asymptotic independence through AR(1) processes with varying strengths of serial correlation, as well as other marginal shape parameters \(\bar{\xi }_{m}\) to assess the performance of each return level estimate for different tail behaviours.

Table 6 Sampling distribution summaries for \(P_{r, {\mathsf{mean}}}\), \(P_{r, {\mathsf{upper}}}\) and \(P_{r, {\mathsf{pred}}}\) using \(L=1000\) repeated simulations of \(N=10,000\) years of threshold exceedances from the seasonal piecewise model obtained from fits to the Bradfield wind speed data

Table 6 summarises one arm of the simulation study, showing sampling properties for the different exceedance proportions for the seasonal piecewise model using \((\bar{\sigma }_{m},\bar{\xi }_{m},\bar{\alpha }_{m})\) from the original fits to the Bradfield wind speed extremes as discussed in Sect. 3.1. Although not shown here, similar findings were obtained for different \(\bar{\alpha }_{m}\) and \(\bar{\xi }_{m}\), and for simulations based on the hierarchical model (although, owing to the sharing of information across different sites and seasons, sampling variability was substantially reduced here); results using AR(1) processes for the dependence structure bore similar findings. The table shows results for \(r=10\), 50 and 200 years, although results for other return periods were also examined. We make several observations:

  • \(z_{r, {\mathsf{mean}}}\) consistently leads to significant over-estimates of \(r^{-1}\) (i.e. the sampling distribution means for \(P_{r, {\mathsf{mean}}}\) are higher than the intended exceedance probabilities \(r^{-1}\), and the 95 % confidence interval lower bounds from these distributions always exceed \(r^{-1}\)). This suggests that using the posterior mean could result in substantial under-protection.

  • The predictive return levels \(z_{r, {\mathsf{pred}}}\) consistently lead to significant under-estimates of the intended exceedance probabilities. However, this is to be expected: these quantities have taken into account any variability in the estimates of marginal and dependence parameters, as well as uncertainty in future observations. Thus, we would expect \(z_{r, {\mathsf{pred}}}>z_{r, {\mathsf{mean}}}\), leading to exceedance probabilities which are possibly smaller than \(r^{-1}\). In practice, this could lead to over-protection. However, this might be on a par with the common practice of designing to the upper-endpoint of the 95 % confidence interval for \(z_{r}\) (see next point), but with uncertainty in future observations also included.

  • In all cases, there appears to be no significant difference in the exceedance proportions resulting from \(z_{r, {\mathsf{pred}}}\) and \(z_{r, {\mathsf{upper}}}\), although the sampling distribution means are, in most cases, smaller for \(z_{r, {\mathsf{pred}}}\); see previous point.

Our simulations show that none of the return level estimators achieve their stated exceedance probabilities of \(r^{-1}\). Although this should be expected of \(z_{r,{\mathsf{upper}}}\) and \(z_{r,{\mathsf{pred}}}\), the fact that \(z_{r,{\mathsf{mean}}}\) gives consistently over-estimated values for these exceedance probabilities indicates that this posterior summary might be inadequate in any practical application. As expected, \(z_{r, {\mathsf{upper}}}\) and \(z_{r,{\mathsf{pred}}}\) give consistently smaller estimates of these exceedance probabilities. However, as a single number summary, both at least have uncertainty in parameter estimation built in, \(z_{r, {\mathsf{pred}}}\) also allowing for randomness in future observations.

4 Conclusions and recommendations

We have presented a summary of the current state of play with regard to the methodology for return level estimation, and here we provide some conclusions and some recommendations for practitioners. Clearly block maxima methods are wasteful of data, and if return level estimation is the priority, they should only be considered as a serious option if block maxima are the only data available, or if other aspects of the model being implemented make it so complex that the extra structure involved with imposing threshold selection of extremes is considered a step too far. As an example, Atyeo and Walshaw (2012) take this view.

Generally threshold methods should be preferred, as they are less wasteful of data. However, given this, a key recommendation is that the traditional POT approach is discarded. In addition to being wasteful of data, the sub-asymptotic theory of this approach indicates that estimates of parameters are biased (Eastoe and Tawn 2012) backing up empirical findings by Fawcett and Walshaw (2007). The recommended alternative is to use all exceedances, through careful estimation of the extremal index. On the basis of this work we recommend using one of the non-parametric intervals estimators proposed by Ferro and Segers (2003) or Süveges (2007). Our recommendation for assessing the uncertainty associated with return levels is to produce confidence intervals using a block bootstrap procedure, as described fully in Fawcett and Walshaw (2012). Alternatively, if one wishes to take a more theoretical approach than that based on estimation of the extremal index, then the sub-asymptotic behaviour of cluster peaks is derived in terms of a combination of terms based on the marginal and dependence behaviour of all exceedances respectively, giving rise to an appropriate model (Eastoe and Tawn 2012).

The recommendations thus far are aimed at those wishing to take a frequentist approach to inference. However the authors would favour a Bayesian approach, and would recommend this to practitioners wherever their philosophical approach to inference, and their willingness to get involved with the computational issues, permit! Inference is bound to be improved by the incorporation of useful prior information, and this is almost always available in one form or another. This could be through genuine elicitation of expert beliefs, but more commonly the Bayesian approach allows for the incorporation of information from other studies, or from other locations being considered in the same study, thereby providing a very natural route to sharing information and thereby improving estimation precision. Thus the Bayesian approach is a natural way to consider spatial or hierarchical models for extremal behaviour at multiple sites (of course, such models could be estimated in the frequentist setting; however, we believe MCMC techniques within the Bayesian framework provide a much more convenient route to inference here). Estimation uncertainty is now naturally represented in the posterior distributions of all quantities of interest, including return levels, and this information is easily extracted from any sampling scheme used for inference. Finally, there has been a long-standing clash between frequentist statisticians and many practitioners in the interpretation of return levels. In our experience, practitioners often take the view that a return level estimate, in itself a statement about probabilities, should not then need to be accompanied by an estimate of uncertainty as to its value. The Bayesian approach, unlike that of the frequentists, is fully supportive of this view. The posterior predictive value for a return level does exactly what is required by such practitioners, in that all uncertainty about parameter estimation (and randomness in future observations) has been integrated out in the provision of this prediction, which is then correctly interpreted as a probability statement which does not (and should not) be accompanied by an assessment of uncertainty. Of course uncertainty about the model itself is always present, but both frequentist and Bayesian perspectives are always conditioning on the fitted model being correct when presenting results, while acknowledging that this is inevitably an approximation to the truth.

Although this paper is part review / survey in nature, so vast is the literature on return level estimation—both in Statistics journals and journals of a more applied nature—that the review element of this article is not exhaustive. Indeed, readers need look no further than the SERRA journal itself to find many articles relating to the problem of return level estimation in various environmental applications. For example, papers by Shiau (2003), Xu et al. (2010), Galiatsatou and Prinos (2011), Vanem (2011) and Van der Vyver (2015) all tackle the issue of return level or return period estimation in a variety of contexts, most of which use methods similar to those presented in Sect. 1. Serinaldi (2015) also provides a very interesting SERRA communiqué on return period estimation, relevant to the work in this paper. However it is our belief that, given the compelling case for the use of Bayesian methods presented here and elsewhere, it is surprising that such methods have not yet become commonplace in practice.

To sum up then, we recommend using a method which makes use of all threshold exceedances wherever possible, and we believe that a Bayesian approach is preferable where this is feasible. Of course all models need to have a sensible (often pragmatic) approach to seasonal variation built in, and all of the modelling approaches we have described are amenable to being extended to incorporate covariate effects, including temporal trends, in the parameter values and hence the return levels. We believe the methods we propose for handling temporal dependence allow all threshold excesses to be pressed into use in a fairly simple way. Further, working within the Bayesian framework allows the estimation of the predictive return level—a quantity which lends itself to easy communication with practitioners having, as it does, all sources of uncertainty built-in. The methods we outline are robust and versatile, and could be easily applied to most environmental variables.