1 Introduction

The credibility of climate predictions rests on the treatment of uncertainty. For a given forcing, uncertainty arises from unknown model error, expressed as the discrepancy between the predicted model state and the actual future climate state. The two most important sources of error in this context are structural error, caused by the imperfect construction of the parameterisations, and parametric error, caused by non-optimal calibration of model parameter values. Errors in initial conditions can be treated as analagous to parametric errors for our purposes.

An archetypal problem is the stability of the Atlantic meridional overturning circulation (AMOC) often equated with the Atlantic thermohaline circulation (THC), although the AMOC forcing is not entirely thermohaline. Changes in the AMOC would have major consequences for European and global climate (Vellinga and Wood 2002, 2008) but models simulate a wide range of possible future behaviour (Gregory et al. 2005; Stouffer et al. 2006). Most models tend to show a weakening of the present Northern-sinking pattern of AMOC, as measured by the average rate of sinking of water mass in the North Atlantic, in response to anthropogenic carbon emissions. As part of a large-scale comparison of modelling results, Gregory et al. (2005) found a 10–50% weakening of the AMOC in 140-year simulations with CO2 increasing to four times pre-industrial levels. The AMOC is widely believed to be sensitive to freshwater forcing, either by a stronger hydrological cycle in a warmer climate or by ice-sheet melting, thus much effort has gone into so-called “hosing” experiments in which fresh water is added to the ocean in high northern latitudes. Responses to hosing experiments are also widely spread, with Stouffer et al. (2006) finding a reduction between 9 and 62% in response to a not-unreasonable forcing of 0.1 Sv (1 Sv = 106 m3 s−1). Synthesising results from 29 simulations performed by nine separate models for the IPCC’s fourth assessment report (Meehl et al. 2007), weighted by model skill, Schmittner et al. (2005) found a weakening of the AMOC by 25 ± 25% at year 2100.

The prediction of AMOC behaviour thus remains subject to considerable uncertainty, indeed, the thorough elicitation study of Zickfeld et al. (2007) revealed that leading experts believe the range of likely behaviour to be considerably wider than that found in models, partly because of known structural deficiencies. The issue of possible overconfidence in such elicitations is covered in the review by Kynn (2008) who argues that any such bias can be expected to be small in well-designed, real-world studies, particularly in predictive situations and where subjects are experienced in making probabilistic judgements.

It is important to realise that model “intercomparisons” do not amount to a quantification of structural model error for three reasons, firstly most studies only consider a set of “best estimate” simulations, thus deliberately avoiding lower probability outcomes and ruling out comprehensive sampling of the distribution. Secondly, the models are usually structurally similar, potentially sharing certain types of error, Thirdly, differences between models will, in practice, be a mixture of structural and parametric components.

A convincing quantification of structural error in AMOC predictions would require quantitative statistical connection between different simulators (Goldstein and Rougier 2004) and remains some way off. However, parametric error may well be of at least comparable order of magnitude, as evidenced by the wide range of behaviour in single-model ensembles (Edwards and Marsh 2005; Murphy et al. 2004). Quantification of parametric error requires knowledge of model behaviour throughout a typically high-dimensional parameter space, and thus requires large ensembles of runs. Systematic calibration of models without rigorous quantification of errors can be referred to as tuning. The intermediate complexity C-GOLDSTEIN model (Edwards and Marsh 2005), part of the GENIE model framework (Lenton et al. 2007), has been used as a test-bed for a range of tuning techniques, firstly by Edwards and Marsh (2005) who used a basic latin hypercube sampling with 1,000 simulations, then by Beltran et al. (2006) using a cutting-plane optimisation method, Hargreaves et al. (2004) with an ensemble Kalman filter, and Price et al. (2006) who used a multiobjective genetic algorithm. We will use the same model in this study, but on a different spatial grid (implying previous tuning exercises may not be quantitatively relevant). The process of Bayesian calibration applied to climate models has been described in abstract terms by Rougier (2007), but the practical application would be extremely challenging, even for relatively simple models. The first step in a full calibration is the expert elicitation of prior probability distributions for all important parameters. The expert elicitation of Zickfeld et al. (2007) involved full-day interviews with 12 experts, for only a handful of well-studied outputs, but complex models can have hundreds of uncertain inputs. Furthermore, expert elicitation of priors would ideally involve additional quantitative analysis, rather than simple questioning. The second step in Bayesian calibration is a quantification of model behaviour across input space, the final step being the incorporation of constraints from observational data. Using the C-GOLDSTEIN model, Challenor et al. (2006) proceeded to the second step in a calibration of AMOC stability and found a surprisingly high probability (around 30–40%) of an AMOC collapse by 2100, possibly influenced by the narrow priors, which were largely based on the posterior distributions found in the tuning exercise of Hargreaves et al. (2004).

Our objective here is to present an alternative to full calibration that greatly simplifies the procedure, by seeking only to identify simulated outputs which can uncontroversially be classified as unphysical. Our example application, which revisits the issue of AMOC stability in C-GOLDSTEIN, serves to illustrate that even with such weak constraints, statistical modelling of ensembles of simulations can still reveal important features of model behaviour.

2 Precalibration

In this section we start by describing—in general terms—the statistical approach to model calibration, taking into account the imperfection of our model. We contrast this with a ‘lightweight’ alternative that we call ‘pre-calibration’, which makes fewer demands on our judgements. We denote our climate model as g(·). Its inputs \(x \in {\mathcal{X}}\) are those quantities about which we are uncertain: in a climate model these would typically be sub grid-scale parameterisations and flux-corrections. Uncertain initial conditions could be treated similarly in principle, but we do not consider this possibility further here. We refer to \({\mathcal{X}}\) as the input space, and the set containing g(x) as the output space. The actual value of the climate is denoted y, and the observed climate is denoted z. Here we assume that the selected model outputs correspond to measurable, observable quantities, such that the model error could, in principle, be quantified in terms of the differences z − y and y − g(x).

2.1 Calibration

The inputs to a complex model are often tuned in order to improve the relationship between the model outputs and observations on the underlying system. ‘Calibration’ is used to describe this process when performed within a statistical framework; see, e.g., Goldstein and Rougier (2004, 2006), or Rougier (2007) in the context of ensemble-based climate prediction. The standard approach is to assert the existence of some ‘best input’ x *, and to quantify the model’s structural error in terms of the discrepancy y − g(x *). The observational errors z − y also need to be quantified, unless they are judged to be dominated by structural error (Rougier 2007). The probability calculus can then be used, in conjunction with a prior distribution Pr(x *), to infer a conditional or posterior distribution Pr(x *|z): the probability distribution of the best input conditional on the observational data. If a point estimate is needed, e.g. for further evaluations of the model, the value E (x *|z) is a natural candidate. Goldstein and Rougier (2009) discuss the ‘best input’ approach, and its foundational and practical limitations.

The main challenge with this approach is to quantify the structural error, y − g(x *). This is an uncertain vector, and, assuming for simplicity that the model is judged to be unbiased and the structural error is chosen to be Gaussian, the quantification of structural error is in terms of a discrepancy variance matrix. This variance matrix is an essential part of the calibration process, and it would be a serious mistake to proceed with the calibration of an imperfect model, such as a climate model, without quantifying it. Ignoring it completely is akin to setting the variance to zero—asserting that the model is perfect except only for uncertainty about the model parameters. This is not acceptable for the current generation of climate models.

Climate scientists have only recently confronted the challenge of specifying the structural error variance (Murphy et al. 2007). Direct attempts are very challenging, thus it is natural to ask whether alternative approaches can be developed which allow for the existence of structural error, and thus do not amount to assuming a model is perfect, but are nevertheless simple enough to be tractable and relatively uncontroversial in their basic assumptions. This is the objective of ‘precalibration’. It is less powerful than full calibration, in terms of its ability to provide accurately quantified probabilistic predictions, but it is considerably less demanding and also less subjective. Precalibration does not attempt to quantify structural error as such, but rather to make progress in analysing model behaviour while allowing for the existence of uncertainty and error in general terms.

2.2 Precalibration

The basic idea of precalibration is to rule out some choices of x as candidates for x *. In order to do this, we begin by identifying model outcomes that are sufficiently contrary to established system behaviour that they can be relatively uncontroversially classified as ‘non-physical’; for example, a pre-industrial Arctic with no sea ice. If g(x) is judged non-physical, we are prepared to assign a zero or near-zero value to the probability that x is a good candidate for x *, in other words, we deem x to be an ‘implausible’ input value. We use the term ‘unphysical’ to refer to model solutions that disagree strongly with observations rather than to states of the world that could not exist. A collapsed AMOC in a simulation of the modern climate, for instance, will be classed as unphysical, although it could be a physically sensible solution in certain paleoclimate regimes. Equally, the relevant criteria could, for instance, be biological rather than purely physical.

The attractions of precalibration are: (1) is it based on simple and relatively uncontroversial criteria; (2) it does not require us to specify a prior distribution for x *; and (3) it does not make explicit or detailed use of the actual observations z. Its limitation is that it does not permit us to narrow our set of candidate values for x * to the extent that a fully probabilistic calibration using the same evaluations and observations might have done. Nevertheless, in practice there is a balance between the degree to which the ruling-out becomes controversial, and the extent to which the set of candidates for x * is reduced. Also note that precalibration does not rule out a subsequent calibration using z: there is no double-counting because we do not have to consult z explicitly when classifying certain values for g(x) as non-physical. Ultimately, then, precalibration provides a relatively low-cost opportunity to learn about the model inputs, which does not compromise further analysis.

Ideally, the process of precalibration involves the following two steps.

  1. 1.

    Identify a region in the output space of the model g(·) which is ‘non-physical’;

  2. 2.

    Map this region back into the input space,

    $$ {\mathcal{N}} \mathrel{\triangleq}\left\{x \in {\mathcal{X}} : g(x)\,\hbox{ is non-physical}\right\}. $$
    (1)

In practice, we cannot compute g(x) for every x. Hence, we define the implausibility of x, which is the probability that g(x) is non-physical:

$$ \hbox{Imp}\!\left({x}\right) \mathrel{\triangleq} \hbox{Pr}\!\left( x \in {\mathcal{N}}\right) = \hbox{Pr}\!\left(g(x)\,\hbox{ is non-physical}\right). $$
(2)

With infinite resources Imp(x) would be either 0 or 1, because we would simply evaluate g(x) and see whether or not it is in \({\mathcal{N}}. \) Implausibilities between 0 and 1 arise because in practice we are obliged to predict whether or not g(x) ∈ N, based on an ensemble of model evaluations. Therefore, the calculation of the relevant probabilities, and hence of implausibility, is based on an ensemble and on a statistical model. As a result Imp(x) will not be totally objective, because judgements are involved about where to evaluate the climate model, and how to build the statistical model. With sufficient evaluations the impact of these judgements will be minor, but where resource constraints limit the number of evaluations there will be a trade-off between the transparency of the method, and the additional information supplied through our judgements. In our analysis below we have favoured transparency, but we are fortunate to have a fairly large ensemble (more than a 1,000 model evaluations). Rougier et al. (2009) provides an example of using more detailed judgements about the model.

2.3 Projection

Implausibility scores any point \(x \in {\mathcal{X}}\). However, if \({\mathcal{X}}\) is not low-dimensional, it is not easy to convey implausibility information. What we would really like to be able to analyse and discuss is the effect of small subsets of the inputs; for example, in our Genie-I climate model below we would like to be able to identify whether a combination of low values of Atlantic–Pacific moisture flux, APM , and high values of atmospheric moisture diffusivity, AMD (Table 1) is likely to be non-physical, and discuss why this might be.

Table 1 Inputs for the Genie-I model

Suppose we are interested in the subset x 1, ..., x m of the inputs x 1, ..., x n , spanning a subspace \({\mathcal{X}}_A \subset {\mathcal{X}}\), where x A  = (x 1, ..., x m ) and x = (x 1, ..., x n ) = (x A x B ). We define the projection of implausibility onto the subspace \({\mathcal{X}}_A\) by asserting that a given point \(x_A \in {\mathcal{X}}_A\) is implausible if for every value of x B , we expect (x A x B ) to be implausible. This notion, first suggested in this context by Craig et al. (1997), can be expressed

$$ \hbox{Imp}\!\left(x_A\right) \mathrel{\triangleq} \min\limits_{x_B} \hbox{Imp}\!\left(x_A, x_B\right). $$
(3)

If x A is implausible, i.e., Imp(x A ) is close to one, then Imp(x A x B ) must be close to one for all x B , i.e. all values of x compatible with x A are likely to be implausible.

To illustrate, imagine that x = (x 1, x 2) and that Imp(x) is generally low, but has a ridge of high values running along x 1 = x 2. In this case, according to (3), both Imp(x 1) and Imp(x 2) are low, as we never see the ridge in the one-dimensional projections. But because of the possibility of the ridge, it would be wrong to say that all values of x 1 were not-implausible. Therefore an implausible region in a subset of the inputs is strong information, but the absence of such a region does not rule out the possibility of an implausible region in a superset of our subset. In practice, we would hope to find implausible regions in small subsets of the inputs, as these can be visualised graphically.

3 Our climate model

Our climate model, which we denote Genie-I, comprises a reduced physics (frictional geostrophic) 3D ocean model coupled to a 2D energy moisture balance model (EMBM) of the atmosphere and a dynamic-thermodynamic sea-ice model. The ocean model includes realistic bathymetry, an isoneutral and eddy induced mixing scheme and spatially varying drag. The version used in this study is configured on a 64 x 32 grid, with eight logarithmically spaced depth levels in the ocean. In the work here, we use a seasonal version of the model (seasonally varying insolation). See Edwards and Marsh (2005) for a full description of the model. This version of GENIE (also referred to as C-GOLDSTEIN) is orders of magnitude less computationally expensive than most other 3D ocean-climate models, but still retains the nonlinear dynamics of the AMOC, and has thus proven a useful model for demonstration of climate model calibration techniques (Hargreaves et al. 2004; Beltran et al. 2006; Price et al. 2006). However, calibration will depend strongly on the resolution. The previous studies had a lower resolution in longitude, and a constant area for all gridcells, implying different latitudinal distribution of gridpoints with higher equatorial and lower polar resolution. Nevertheless, the choice of input parameter ranges is based on these earlier studies, in particular Edwards and Marsh (2005). In keeping with the philosophy of precalibration, the upper and lower bounds are intended to exclude only uncontroversially extreme values.

The Genie-I inputs are given in Table 1. Many of the inputs are common to other models but some require explanation: the drag parameterisation replaces all nonlinear and diffusive momentum effects with a simple linear friction term for which the inverse coefficient, ODC, has the dimensions of time; this frictional formulation leads to excessive dissipation of momentum, which is countered by a scaling of the windstress by a factor WSF, to give realistic wind-driven flow; the single-layer atmosphere lacks dynamical eddies, thus atmospheric transport is perfomed by diffusion according to fixed latitudinal profile with amplitude AHD and width WAH for heat, and a constant amplitude AMD for moisture. There is also advection by fixed wind fields, scaled by coefficients ZHA and ZMA, but heat is only advected zonally. Modelled atmospheric moisture transport from Atlantic to Pacific is relatively weak, but is critical for maintaining the AMOC, so we add a constant Atlantic to Pacific moisture transfer, scaled by the parameter APM. Above a threshold, THP, excess moisture is rained out of the atmosphere instantaneously (in other versions of GENIE, a small timelag is applied). Formally, the time-derivative of velocity is neglected, but at each timestep the calculated velocity is relaxed back to the value at the previous timestep at a rate controlled by parameter LRL.

4 Sequential design

Our intention is to evaluate the parameter-space of Genie-I, in order to identify, if possible, low-dimensional regions that are implausible. These regions will help us to understand Genie-I better, and make our subsequent use of the model more efficient, for example by avoiding model evalutions at implausible input values.

In a pilot study we discovered that the Genie-I solver failed to complete the spin-up at some input values. Such numerical failures could have two possible causes, either the discrete numerical solver has failed to approximate the correct, physically reasonable solution to the continuous model equations, or the solution to the continuous model equations for the given inputs is itself unphysical, featuring extreme values which cause the solver to fail. The distinction between these possibilities may be important for subsequent improvements to the model and solver, but at this stage of the analysis we are concerned with locating implausible input values for a given configuration of the model and solver, thus we treat the failure of the solver to spin-up at x as prima facie evidence that g(x) would be non-physical. In other words, \({\mathcal{C}}^c \subseteq {\mathcal{N}}\), where \({\mathcal{C}}\) is that part of the input space where the solver completes, and \({\mathcal{C}}^c\) is its complement. Further examination (see below) revealed that most of the failures were ultimately physical in origin although in general applications it may not always be practical to determine whether the origin of failure is numerical or physical.

We divided our budget of approximately 2000 evaluations into two parts. In the first part we used a space-filling design over the whole of the input space. We used the result of this ensemble to construct a statistical model for \(\hbox{Pr}\!\left(x \in {\mathcal{C}} \right)\). We find that 341 of the evaluations in this ensemble of 1,000 evaluations completed. For the second ensemble we used this statistical model to select evaluations that had a high probability of completion. 799 out of this second ensemble (of 1,087) completed. It is important to appreciate that although 2,087 evaluations may seem like a lot, they are very sparsely distributed through a 16-dimensional space, which has 216 = 65,536 corners. Despite our ensemble, we remain uncertain about whether \(x \in {\mathcal{C}}\), for an arbitrary \(x \in {\mathcal{X}}\).

We now describe our approach in more detail.

4.1 First design

Design for computer experiments is a well-developed area; see, e.g., the review paper of Koehler and Owen (1996), or the textbook of Santner et al. (2003). The standard approach for an initial design is to use a space-filling design such as a maximin Latin Hypercube. This gives reasonable coverage of the input space, providing good information about the main effect of each input, and some information about the low-order interactions.

A maximin Latin Hypercube treats all of the inputs equally. We make one modification, to prioritise the inputs which we judge to be important, termed the ‘active’ inputs (Craig et al. 1997, 2001). We identify OHD, AHD, AMD, WAH, ZHA , and ZMA as likely to be active inputs for our evaluations of Genie-I. These were chosen as they control important transports of heat/moisture in the ocean and atmosphere ( ZHA, ZMA : atmospheric advection; AHD, AMD : atmospheric heat and moisture diffusion; OHD : ocean heat diffusion). We would like our design to be sensitive to interactions among these inputs in particular. Therefore, having generated a 1,000 × 16 maximin Latin Hypercube, we examine all \(\left(\begin{array}{c}16\\6\end{array}\right) =8,008\) sets of six columns, to find the set with the best properties for identifying interactions. We quantify this using the determinant of the 6 × 6 correlation matrix. We assign our six active inputs to the six columns with the largest determinant; crudely, if there was a linear combination among the columns the determinant would be zero, and this is the kind of design we would like to avoid. This type of assignment of inputs to columns is a simple way to prioritise some of the inputs on the basis of weak judgements about which inputs will be active. Bayesian experimental design (see, e.g., Chaloner and Verdinelli 1995) allows for more detailed judgements, where they exist. Note that while the choice of active inputs may be more controversial than other choices in the precalibration process, it can be verified a posteriori, see Sect. 6.1, and is designed purely to aid the statistical modelling process. The conclusions should not be significantly affected.

4.2 Modelling the probability of completion

We evaluate Genie-I at the 1000 values for x, of which 341 complete. We would like to map the relationship between x and completion, in order to avoid performing evaluations with a high chance of failing to complete in the second part of our experiment. For simplicity and transparency, we use standard statistical tools for this task, namely logistic regression with stepwise variable selection, implemented in the Statistical Computing Environment R (R Development Core Team 2004), using the stepAIC function (in the MASS library, see Venables and Ripley 2002). There are some technical concerns about applying logistic regression to the output of a deterministic model such as Genie-I (discussed in Rougier et al. 2009), but we do not consider these to be critical for what is effectively an exploratory analysis.

First, we transform the inputs OHD, OVD, ODC, AHD, AMD , and SID , by taking logarithms. Then we map all inputs onto the range [−1, 1] using the minimum and maximum values in Table 1. This range makes odd and even functions orthogonal with respect to a uniform weighting function, improving the selection of terms in the stepwise selection. We initialise our statistical model with a constant and linear terms only. Then we grow the statistical model using stepwise selection on all quadratics, cubics, and two- and three-way interactions (see, e.g., Draper and Smith 1998, ch. 15). Our chosen statistical model maximises the Akaike Information Criterion (AIC).

Fifty-five terms are added using this approach, and the linear term in LRL is deleted (indicating that LRL has little influence on completion), so that there are no terms in LRL in the resulting statistical model; the SOC input is also marginal. The first interactions selected (i.e. most important) are WAH:AHD, ZMA:OHD, ODC:OHD, AMD:AHD, AMD:OHD, ZHA:AHD, ZMA:AHD, and ZHA:WAH . In Fig. 1 we present a simple visual summary of the way in which the inputs interact with each other. We construct a undirected graph where the vertices are the inputs, and edges indicate interactions. We do not show all the interactions, since that would be hard to read, instead we show the top interactions according to the order in which they are selected. In the absence of a thorough sensitivity analysis of the form of the graph to the details of the statistical model fitting process, the graph must be interpreted with great caution. Nevertheless, where parameters are multiply connected, this suggests that they are relatively important in the determination of completion, and where parameters are linked, there may be nonlinear interactions which are also important. Conversely, parameters which are isolated or do not appear at all may have relatively little influence.

Fig. 1
figure 1

Graph of the main relationships between the inputs for determining the probability of completion. An edge between two inputs indicates a two-way interaction. Three edges to a star indicate a three-way interaction and all three two-way interactions

The completion graph can be interpreted in terms of the analysis of failure modes. In an analysis of 100 randomly selected failed simulations, 98 failed apparently as a result of extremely low temperatures, below −150°C. Of these, 12 had high values of AHD and WAH , apparently leading to numerical failure via diffusional instability in the atmosphere. All but 18 of the remaining failures appeared to result from insufficient atmospheric heat transport to the poles, with low values of some or all of the parameters WAH, AHD, AMD and ZHA . In the graph, a high-diffusion failure mode involving WAH, AHD , is visible around the upper-right star, this region of the graph also contains a low-diffusion failure mode involving WAH, AHD, AMD (which implies latent heat transport through moisture transport) and ZHA , the latter two having no direct connection, perhaps because zonal heat advection can only act on poleward heat advection indirectly, via zonal redistribution of heat, e.g. between land and ocean regions. Such nonlinear effects connecting atmosphere and ocean (via ZMA OHD ) and involving heat and moisture fields, appear in the lower left of the graph. Apart from this link, ocean parameters are surprisingly isolated at the top and bottom of the graph, suggestive of a relatively weak influence on completion. This may be related to a better initial constraint on ocean parameters, or the better conservation of properties in the ocean part of the coupled system (where heat is conserved in the interior), or a less heavily parameterised model than the simple EMBM atmosphere, or simply a better solver, and hence a lesser role in failures. There is no obvious evidence for high fluid-velocity Courant–Friedrichs–Lewy (CFL) failures (e.g. near-limiting velocities prior to numerical failure), and this failure mode was not identified as important, again probably reflecting conservative input parameter ranges.

As a form of statistical model criticism, we can use the resulting statistical model to compute a point prediction for \(\hbox{Pr}\!\left(x \in {\mathcal{C}}\right)\) at any \(x \in {\mathcal{X}}\). As a simple guide to the quality of our statistical model, the following table shows the predicted and actual outcomes for the ensemble, based on our statistical model and a threshold of 50%:

$$ \begin{array}{cccc} & x \in {\cal C}^c & x \in {\cal C} & \hbox{Sum }\\ \hbox{Pr}\!\left( {{x \in {\cal C}}} \right) < 0.5 & 617 & 40 & 657 \\ \hbox{Pr}\!\left( {{x \in {\cal C}}} \right) \geq 0.5 & 42 & 301 & 343 \\ \hbox{Sum} & 659 & 341 & 1,000 \\ \end{array} $$
(4)

This shows a misclassification error for acceptance, defined as the probability that a point above our threshold fails to complete, of 42/343 ≈12%, and a misclassification error for rejection, defined as the probability that a below-threshold point would have completed, of 40/657 ≈6%. These are much better than could be achieved from a more limited knowledge of the ensemble. The case of no predictive knowledge other than the mean, for example, analagous to tossing a biased coin with probabilities 341/1,000 and 659/1,000, would give misclassification errors of 66 and 34% for acceptance and rejection respectively.

4.3 Second design

We use our statistical model for \(\hbox{Pr}\!\left(x \in {\mathcal{C}} \right)\) to assess each candidate for our second design. We set a threshold ν and keep the candidate x if \(\hbox{Pr}\!\left(x \in {\mathcal{C}}\right) \geq \nu\). There are two errors we can make with this approach. First, we can screen out an x which would have completed. Second, we can fail to screen out an x which does not complete. As ν decreases from one to zero we trade the probability of the first error (which is one when ν = 1) against the probability of the second (which is one when ν = 0). Where we set ν will depend on the cost of the two types of error. We regard the the first error as the more critical, and we aim to choose a value for ν that makes the first error roughly half as probable as the second. As shown in the table in (4) the choice of ν = 0.5 satisfies this criterion, based on the results of the first ensemble. About 34% of the evaluations get past the threshold, so if we generate an initial design of 1,000/0.343 ≈2,915 over the whole of \({\mathcal{X}}\) then after screening we should end up with about 1,000 evaluations in our second design, favouring \({\mathcal{C}}\).

We follow the same steps as before, generating a 2,915 × 16 maximin Latin Hypercube, and assigning the active inputs to the best subset of six columns. Then we predict \(\hbox{Pr}\!\left( x \in {\mathcal{C}}\right)\) for each candidate value for x in turn, and keep those for which this is not <0.5. The result is 1,087 evaluations in the second ensemble. After evaluating them, we find that 799 complete, or 74%.

4.4 Transient runs

At this stage of the experiment, we have 2087 evaluations, of which 1,140 complete their spin-up. We now run each spun-up evaluation forward using 1%/annum compound increase in CO2 from 1850 to 2100: in the case of Genie-I this is represented as a direct increase in radiative forcing. At this stage we lose another 94 evaluations (39 from the first design and 55 from the second), for which the solver failed to handle the transient behaviour; again, we classify these as non-completers. This leaves us with 1,046 completed evaluations after both the spin-up and transient phases.

5 Implausibility analysis

5.1 Non-physical ranges

The precalibration outputs and ranges for the Genie-I model are given in Table 2. Note the deliberately wide ‘physical’ ranges. We determined these limits by considering what we would class as non-physical for our Genie-I model. Although we treated the five target outputs individually, it turns out that there is a dominant non-physical mode, which is the absence of positive AMOC cell. In this case the maximum Atlantic streamfunction will be too low; the temperature in the upper Atlantic will be too low; and the Atlantic will be too fresh relative to the Pacific, as the interbasin salinity contrast is known to be closely associated with the northern-sinking positive AMOC state, presumably because denser, high-salinity water is prone to sink in the North Atlantic. Table 2 also shows the percentage of evaluations in our ensemble that are too low or too high in at least one of the years 1850, 1900, 1950, 2000. In total, 23% of our 2,087 evaluations satisfy all five ranges, which is to say that 77% are classified as non-physical.

Table 2 Precalibration ranges for the climate values corresponding to the Genie-I outputs

5.2 Statistical modelling

We now focus on a second set of probabilities, namely Imp(x), as defined in Sect. 2.2. Rather than construct a single statistical model, we choose to construct two, and combine them using the rules of probability:

$$ \begin{aligned} \hbox{Imp}\!\left(x\right) & = \hbox{Pr}\!\left(x \in {\mathcal{N}} \right) \\ & = 1 - \hbox{Pr}\!\left(x \in N^{c}\right)\\ & = 1 - \hbox{Pr}\!\left(x \in {\mathcal{N}}^{c}, x \in {\mathcal{C}}\right) \\ & = 1 - \hbox{Pr}\!\left(x \in {\mathcal{N}}^{c} | x \in {\mathcal{C}} \right) \,\hbox{ Pr}\!\left(x \in {\mathcal{C}}\right)\\ \end{aligned} $$
(5)

where the introduction of \(x \in {\mathcal{C}}\) in the third line follows from \(x \in {\mathcal{N}}^c \Rightarrow x \in {\mathcal{C}}\). and ‘|’ denotes ‘conditional upon’. The last line follows from the definition of conditional probability. This decomposition allows us to construct the full implausibility from our model of completion and from the ensemble of completed runs.

The statistical model for \(\hbox{Pr}\!\left(x \in {\mathcal{C}}\right)\) is similar to the statistical model we have already constructed from the first part of our design (see Sect. 4.2). We refit the statistical model, with the same choice of regressors, but now using the full ensemble of 2,087 evaluations. The incomplete evaluations in the spin-up of the second ensemble are likely to be particularly informative, because they contradict the prediction of the model fitted on the first ensemble alone. The misclassification rate rises to 15%, but a rise is to be expected because we do not re-select the regressors in the model, as a precaution against over-fitting. If the mechanism that triggers a solver failure in the transient phase were different from that in the spin-up, it would tend to cause a rise in the misclassification rate, but we have no evidence that this has occurred in our case.

The statistical model for \(\hbox{Pr}\!\left(x \in {\mathcal{N}}^c | x \in {\mathcal{C}}\right)\) is fitted only on the 1,046 evaluations which complete, in the same way as described in Sect. 4.2. The misclassification rate of the statistical model is 0.5%. Figure 2 shows the graph of the main relationships between the inputs, after building our statistical model. Perhaps not surprisingly, this graph is easier to interpret than the graph for simulation failures. There is a broad separation between ocean parameters on the right and atmosphere parameters on the left, with parameters in the centre of the graph being of the greatest significance for ocean–atmosphere interaction and exhibiting the largest number of interactions in the graph, six for AMD , five for APM and OHD . Ignoring CRF the lower left region comprising THP, ZMA, AMD and APM all control moisture flux, whereas the upper right four parameters control heat flux. The graph reveals which parameters are most important in ocean–atmosphere interactions controlling the AMOC (the principal unphysical mode) and confirms the importance, but relative isolation, of ocean drag ( ODC ), and of the precipitation threshold ( THP ) and moisture advection ( ZMA ) in the atmosphere, parameters which it can be tempting to ignore in trying to understand the model.

Fig. 2
figure 2

Graph of the main interactions between the inputs for determining the probability of a not-unphysical output, among evaluations that complete. See the caption of Fig. 1 for details

We compute the implausibility using two statistical models combined, rather than just one (which we could have constructed using the 481 not non-physical outcomes from 2,087 evaluations), because this allows us to attribute high implausibility consistently between the two possible causes: a failure to complete at x or, if complete, a non-physical outcome. An additional advantage is that the statistical model for \(\hbox{Pr}\!\left(x \in {\mathcal{N}}^c | x \in {\mathcal{C}} \right)\) is more focused than the model for \(\hbox{Pr}\!\left(x \in {\mathcal{N}}^c\right)\), being constrained to a smaller region of the input space, and being descriptive of a simpler outcome. This makes it easier to fit the statistical model (cf the low misclassification rate), and—we hope—easier to interpret the result.

6 Further analysis using implausibility

At this stage we have derived a statistical model for Imp(x) for all values of the input vector x in our input space. This function is many orders of magnitude cheaper to evaluate than the original numerical model, but its form, as a multidimensional function of its inputs, potentially contains valuable information about the behaviour of the underlying model. To illustrate how the implausibility function may be interrogated to obtain such information, we now consider three linked examples. First we order the parameters by importance, then we project the implausibility onto the four most important parameters, then we turn to the existence of the cliff-edge AMOC catastrophe.

6.1 What are the important inputs?

A simple scalar measure can be used to summarise the importance of each input in determining implausibility. Here, an input is deemed important if it can cause a large change in implausibility. Note that this differs from the more usual interpretation, in which an ‘important’ input is one which can cause a large change in g(x), as identified in a sensitivity analysis. Therefore, for each input in turn we take a sequence of values from small to large, and for each value we compute the implausibility over a space-filling design in the other inputs. We then take the mean absolute value for the changes in these implausibilities as the value increases, and summarise these in a single mean value for each input.

The result is shown in Fig. 3. The two inputs AHD and AMD are the most important, followed by WAH and OHD . The first five inputs were among the six specified as ‘active’ inputs in our experimental design, providing an a posteriori verification of their importance. Indeed the ordering suggests that the quantitative importance of inputs in controlling implausibility is primarily determined by their effect on meridional heat and moisture transport. Note that the effect of each input is measured relative to its assumed input range, in other words to our uncertainty about its best input value. In the heavily parameterised, largely diffusive EMBM atmosphere of Genie-I, the weakly bounded diffusivity amplitudes, AHD and AMD which strongly control heat and moisture transport, thus appear as the dominant parameters. The next six inputs also play significant roles in global heat or moisture transport, the ocean drag coefficient ODC by exerting a frictional drag on the large-scale ocean transport, and the ocean vertical diffusivity OVD via its effect on the AMOC. The remaining eight parameters generally have only indirect effects on global-scale transports, with the exception of ZHA which was amongst our six ‘active’ inputs but, unlike ZMA , does not affect meridional transport, and furthermore is constrained to a small maximum value relative to the diffusive transports, possibly explaining its relatively minor role in implausibility.

Fig. 3
figure 3

Scalar summary of the importance of each input in determining implausibility, ordered from most to least important (see text for details). The value indicates the degree to which a change in the value of the input changes implausibility over the input space as a whole

6.2 Implausibility of the four most important inputs

We now project implausibility onto the four most important inputs identified in Sect. 6.1. Figure 4 shows a four-way layout. The lightest areas have implausibility of <5%, while the darkest areas have implausibility of >95%. The difference between the left- and right-hand panels shows that low values of WAH are more implausible than high values, and the lack of difference between the top and bottom panels shows that changing OHD does not alter this. Within the right-hand panels, very large values of AHD are implausible for all values of AMD . As AHD and WAH both affect the form of the atmospheric thermal diffusivity as a function of latitude, implausibility at high AHD and WAH could be related to high-diffusivity numerical breakdown. Nevertheless, even though AHD and WAH are very closely related, their effects on implausibility are not trivially related. The lowest values of all four transports, in the bottom left of the upper left plot, also show high implausibility, possibly related to the unphysical polar conditions identified previously for low diffusion. The interaction between heat and moisture diffusivities AHD and AMD is not simple: starting from the saddle point in the lower left plot, a reduction in AMD increases implausibility but can be offset by either an increase or a decrease in AHD or, to a lesser extent, a decrease in ocean heat diffusivity OHD . It could be relevant that increased moisture transport implies increased latent heat transport but, on the other hand, meridional moisture and heat transport have opposing direct density effects on driving the thermohaline circulation. Alternatively, the nonlinear features of the plot may be related to competition between the five different physicality targets. We do not attempt to rationalise the form of the implausibility surface in any more detail, since our objective was simply to illustrate the potential for mapping out its behaviour in multiple dimensions. In general, the surface will have some complicated dependence on all 16 inputs. In the next section, we focus on a more tractable projection onto only two dimensions.

Fig. 4
figure 4

Implausibility projected onto the four most important inputs, as judged from Fig. 3; the darker areas are more implausible, and the contour lines are at 5, 25, 50, 75, and 95%. Each panel shows AHD and AMD , while the four panels comprise a two-way layout of OHD (top low, bottom high) and WAH (left low, right high). Note that both AHD and AMD are modelled on a logarithmic scale. Units are given in Table 1

6.3 The AMOC ‘cliff-edge’ catastrophe

We now consider the question of the existence of a ‘cliff-edge’ AMOC catastrophe in freshwater forcing input space, as identified by Marsh et al. (2004), by considering projections of implausibility onto relevant subspaces of the inputs. Figure 5 compares a cross-section through Imp(x) with the relevant projection Imp(x A ) from (3). In the left-hand panel of Fig. 5 APM and AMD have been varied in a grid, with the other 14 model inputs held fixed at their standard values. This picture tells us about Genie-I’s response on one 2-dimensional plane through the 16-dimensional model input space. The ‘cliff-edge’ indicates that on this plane there is a sharp division between settings of APM and AMD for which the model’s response is non-physical, and those for which it is not. This panel is somewhat comparable to the top-left panel of Fig. 5 of Marsh et al. (2004). In that analysis the only indicator of non-physicality was the absence of a strong positive AMOC. The boundary ran bottom-left to top-right. Our figure shows that low values of AMD also cause non-physical outcomes. In comparing these two results it is natural to hypothesise that the additional non-physical region in our analysis is due to the additional indicators of non-physicality that we include. It is also worth noting that this low AMD region corresponds to the strongest AMOC region in the plot, thus it presumably corresponds to the high-AMOC failures. This could occur where low atmospheric heat transport is being compensated by the ocean. Also it may be significant that the lowest value for AMD used by Marsh et al. was 5 × 104 m2 s−1, while the different resolution and lack of seasonality could also have a bearing on the failure modes.

Fig. 5
figure 5

The probability that the model output is non-physical (Imp(x)) shown for combinations of the Atlantic–Pacific moisture flux, APM (Sv), and the atmospheric moisture diffusivity, AMD (×10 6ms−1). a All other model inputs set to their standard values, see Table 1. b Implausibility, projected through the other model inputs, using Eq. 3. Darker shading indicates a larger probability. Crosses indicate the points plotted in Fig. 6

The lefthand panel of Fig. 5 tells us nothing about the model input space as a whole. The righthand panel, on the other hand, does exactly this, as it shows the projected implausibility, for APM and AMD , which involves projecting through the other 14 model inputs. Minimising over the other 14 model inputs cannot result in an implausibility that is larger than that when the other 14 are at the standard values, hence no point in the righthand panel can be darker than in the lefthand panel. The result of projection is that much of the implausible region disappears: for low moisture diffusivity AMD and high Atlantic–Pacific moisture flux APM , compensating adjustments in other parameters can give rise to physical model output. On the other hand, the low APM , high AMD region, corresponding to the AMOC cliff-edge, shifts towards more extreme values, but otherwise remains intact. In this region, even wide-ranging adjustments in 14 other parameters apparently cannot produce physical output.

To illustrate the connection between the cliff-edge and the AMOC, Fig. 6 shows the AMOC in three simulations corresponding to the crosses marked in Fig. 5 along a transect across the cliff-edge. At each point, the values of the remaining 14 inputs are chosen to minimise the implausibility Imp(x). As expected, in the uppermost panel, corresponding to the implausible region in the projected APM - AMD space, the AMOC is in a fully collapsed state. The middle panel represents an intermediate point on the cliff edge itself, at which the least implausible state, as shown, has a visible, but very weak positive AMOC cell in the deep Atlantic. The lower panel shows a location which is plausible even at standard values of the remaining parameters, where the least implausible inputs give a strong positive AMOC.

Fig. 6
figure 6

Zonally averaged Atlantic meridional overturning circulation (AMOC) in Sverdrups (1 Sv = 106 m3 s−1) for three simulations corresponding (in vertical order) to the least implausible inputs at the three points indicated as crosses in Fig. 5.Dashed lines correspond to negative values, latitude is in °

Note that the Figs. 4 and 5b involve non-trivial computation, as the calculation of Imp(x), from (3), requires a numerical minimisation of the statistical model for Imp(x) over all the input dimensions not shown in the figures. The projection code divides the inputs into three types: the ones we are projecting onto, other active inputs, and remaining inputs. The ‘other active’ inputs are explicitly minimised over, while the remaining inputs are spanned with a space-filling design, (the Sobol sequence, implemented in Würtz 2007). The minimum over the points in the space-filling design is taken to be the minimum over the whole input space. Therefore, our implausibility values are upper bounds, but sensitivity tests suggest that our results are relatively accurate.

7 Summary and discussion

Perturbed physics experiments (PPEs) allow us to account for our uncertainty about the values of the inputs to a complex model, such as an EMIC. Typically we express our uncertainty marginally, input-by-input, for example in terms of ranges and simple transformations, as we have done in our example (Table 1). A problem can arise in this type of experiment: the model’s solver might break down at some combinations of input values. Typically the solver will be tuned to perform well in the region centred on the model’s standard input values. It may also be robust against one-input perturbations; e.g. axial designs in which each input in turn is taken to its minimum and maximum values, with all other inputs at their standard values (see, e.g., Murphy et al. 2004). but it may break down when several inputs are varied simultaneously.

This is exactly the problem we faced with our Genie-I EMIC, where combinations of extreme (and even not-so-extreme) input values caused the model to fail to complete its spin-up. In this situation we can write a more robust solver (e.g. take smaller time-steps or make more fundamental changes to the model), or we can treat the solver failure as informative for the model. After investigation, we adopted the latter course, and classified those input values for which the solver failed implausible (for this particular model setup). This was a particularly convenient choice in our analysis, but it is also a natural generalisation of the current practice of only running complex models at their standard input values, which amounts to treating all non-standard choices of the input values as implausible (i.e. not worth evaluating). Our approach is a generalisation because we treat the standard value as only one point within a set of not-implausible input values. Our approach is best understood from the standpoint of calibration, which attempts to find the ‘best input’ value x *. Precalibration is concerned with reducing the set of possible candidates for x *. In either case, we must begin by fixing a definition of our model and its solver, and deciding which parameters are available as inputs. These choices could be revisited if solver failure turns out to be a major issue, as indeed could the form of the parameterisations themselves. Indeed, learning about model parametric error would ideally constitute part of an iterative process to modify both solver design and model parameterisations. The treatment of non-completions is liable to be even more important in more expensive models and alternative approaches could be envisaged, such as including timestep length as a variable parameter. In any case, it will be desirable to avoid excessive non-completions, which are largely wasted simulations.

One of the difficulties of PPEs is that it can be hard to specify our prior uncertainty about the best value of the model inputs. This is often because of difficulties with the operational definition of the model inputs, a problem that becomes more acute in lower resolution models. Ideally, we would have sufficient observations that, in a statistical calibration, our quantification of prior uncertainty would be relatively unimportant; we could then use wide intervals and simple distributional shapes (e.g., triangular, Beta, Gamma). Unfortunately, this is seldom the case with climate models, where the observations, though abundant, are strongly correlated, so that the likelihood function, i.e. the region of “good” inputs to the model, tends not to concentrate, but to have long ridges (Rougier 2007). Another problem is that this calibration requires us to quantify a measure of our model’s structural error: this is very challenging.

In this paper we have proposed a simpler version of calibration, which we term precalibration, based on the notion of implausibility (Craig et al. 1997). Precalibration is a low-cost way of ruling out input values that give rise to non-physical outcomes, and requires us only to specify what outcomes we deem to be non-physical. We use our ensemble to construct a statistical model that allows us to compute the probability that any particular input value will give rise to a non-physical outcome. It is important in this case that our ensemble explores the model’s input space in an efficient way, so that we get as much information as possible from our finite set of evaluations. In this paper we have used space-filling designs from the statistical field of Computer Experiments, and we have used sequential design to avoid evaluations likely to be non-physical.

The extent to which the physicality criteria are uncontroversial will, in practice, be a compromise against the extent to which the candidate region for x * is reduced. Tighter bounds of physicality imposed on the model output would, in general, reduce the size of the region of not-implausible inputs, but make the ruling out process correspondingly more controversial. Similarly, although prior distributions for inputs are not required, narrower input ranges may lead to better resolution of the output space, but would imply more controversial a priori decisions. In principle, however, the objective is only to remove regions with zero probability (which implies that precalibration should not distort any subsequent calibration). This may be highly pertinent in probabilistic risk assessments which are driven by the tails, such that ‘almost implausible’ inputs are associated with high costs. Multiple iterations of precalibration (which may either increase or decrease the implausible region) could be highly valuable in such caseses since the exercise focuses implicitly on defining the edges of acceptable space. To proceed to a probabilistic risk analysis, however, requires explicit weighting of outcomes.

In our illustration with the Genie-I EMIC we have used implausibility to identify implausible choices for various selected inputs. In so-doing we have generalised the analysis of Marsh et al. (2004), which considered APM and AMD only, and we have shown that, in our model, the existence of a cliff-edge catastrophe is robust to the inclusion of uncertainty about more model inputs, but that the location of the cliff-edge depends strongly on other parameters. It is worth stressing that our analysis uses an ensemble of model evaluations which is completely general; which is to say that many other questions can also be addressed using the same ensemble. Given that ensembles are expensive and time-consuming to generate, we would strongly recommend the use of statistical experimental design techniques to construct general purpose ensembles. These can then be used to address specific questions using the techniques we have outlined here. As an example, Holden et al. (2009) apply precalibration to the estimation of glacial and future climate sensitivity and changes in terrestrial carbon storage. Their analysis demonstrates that the application of weak constraints on model inputs and outputs, even in two contrasting climate states, still allows for a wide range of predicted behaviour. For a detailed analysis, more statistically intensive approaches are also possible (see, e.g. O’Hagan 2006; Rougier and Sexton 2007; Rougier 2008). However, these require more specialised statistical input and more computing resources. But an initial exploratory analysis using implausibility is inexpensive and may often prove fruitful.