1 Introduction

When taken beyond their deterministic predictability limits of about ten days, the output of General Circulation Models (GCMs) can no longer be usefully interpreted in a deterministic sense; they are at least implicitly stochastic and if they use stochastic parameterizations, they are explicitly so. In this “macroweather” regime, successive fluctuations tend to cancel each other out so that in control run mode, each GCM converges ultra slowly (Lovejoy et al. 2013) to its own climate. Assuming ergodicity, the control run climate is deterministic because it is the long-time average climate state, but the fluctuations about this state are stochastic.

Although each GCM climate may be different—and different from that of the real world—various studies (see e.g. the review (Lovejoy et al. 2018)) have indicated that the space–time statistics of fluctuations about the climates are statistically realistic—that they are of roughly the same type as the fluctuations observed in the real climate about the real climate state. For example, over wide ranges, and with realistic exponents, they exhibit scaling in both space and in time and at least approximately, they obey a symmetry called “statistical space–time factorization” (Lovejoy and de Lima 2015) that relates space and time. This suggests that the main defect of GCMs is that their fluctuations are around unrealistic model climates.

Many different stochastic processes can yield identical statistics. This leads to the possibility—developed in (Lovejoy et al. 2015)—that a simple model, having the same space–time statistical symmetries as the GCMs and the real world, could be used to directly model temperature fluctuations. If in such a model, the long term behaviour and the statistics of the fluctuations are forced to match that of real-world data in the past, the model would thus combine realistic fluctuations with a realistic climate, leading to significantly improved forecasts. Indeed, using this ScaLIng Macroweather Model (SLIMM), (Lovejoy 2015) gave some evidence for this by accurately forecasting the slow-down in the warming after 1998.

Starting with (Hasselmann 1976), various stochastic macroweather and climate models have been proposed. Today, these approaches are generally known under the rubric Linear Inverse Modelling (LIM), e.g.: (Penland and Matrosova 1994; Penland and Sardeshmukh 1995; Winkler et al. 2001; Newman et al. 2003; Sardeshmukh and Sura 2009). However, they all are based on integer order (stochastic) differential equations and these implicitly assume the existence of characteristic time scales associated with exponential decorrelation times; such models are not compatible with the scaling. To obtain models that respect the scaling symmetry, we may use fractional differential equations that involve strong, long-range memories; it is these long-range memories that are exploited in SLIMM forecasts. From a mathematical point of view, the fractional differential operators are of Weyl type (convolutions from the infinite past) so that they are not initial value problems, but rather past value problems.

In this paper we present the new Stochastic Seasonal to Interannual Prediction System (StocSIPS), that includes SLIMM as the core model to forecast the natural variability component of the temperature field, but also represents a more general framework for modelling the seasonality and the anthropogenic trend and the possible inclusion of other atmospheric fields at different temporal and spatial resolutions. In this sense, StocSIPS is the general system and SLIMM is the main part of it dedicated to the modelling of the stationary scaling series. The original technique that was used to make the SLIMM forecasts was basically correct, but it made several approximations (such as that the amount of data available for the forecast was infinite) and it was numerically cumbersome. Here, for the developing of StocSIPS, we return to it using improved mathematical and numerical techniques and validate them on ten different global temperature series since 1880 (five globally-averaged temperature series and five land surface average temperature). We then compare hindcasts with Canada’s operational long-range forecast system, the Canadian Seasonal to Interannual Prediction System (CanSIPS) and we show that StocSIPS is just as accurate for 1-month forecasts, but significantly more accurate for longer lead times.

2 Theoretical framework

2.1 SLIMM

Since the works of (Hasselmann 1976), there have been many stochastic climate theories based on the idea that the high-frequency weather drives the low-frequency climate as a stochastic forcing [for a review, see Franzke et al. (2014)]. The first and simplest approaches for solving the stochastic climate differential equations deduced from these theories were made through linear inverse models (LIM). The theoretical justification of LIM methods is based on extracting the intrinsic linear dynamics that govern the climatology of a complex system directly from observations of the system (inverse approach). However, they implicitly assume exponential decorrelations in time, whereas both the underlying Navier–Stokes equations (and hence models, GCMs) and empirical analyses respect statistical scaling symmetries [see the review in Lovejoy and Schertzer (2013)]. Due to this lack of solid physical basis, LIM approaches are referred to as “empirical approaches”. Nevertheless, its use is justified as a simpler alternative to the difficult task of improving numerical model parameterizations by appealing to physical arguments and first-principle reasoning alone.

Exponential decorrelations assumed by LIM models imply a scale break in time and—ignoring the diurnal and annual cycles—the only strong scale break is at the weather-macroweather transition scale of \(\tau_{w} \approx\) 5–15 days (slightly varying according to location (especially latitude and land versus ocean), and also with slight variations from one atmospheric field to another. For the temperature, there is a transition in the spectrum at \(\omega \sim \omega_{w} \approx 1/\tau_{w}\), with two different asymptotic behaviors for very high and very low frequencies [see Fig. 4 in Lovejoy and Schertzer (2012)]. Empirically we find that \(E_{T} ( \omega)\sim \omega^{ - \beta }\) with, \(\beta_{h} =\) 1.8 \((\omega > \omega_{w} )\) and \(\beta_{l} \approx\) 0.2–0.8 \(\left( {\omega < \omega_{w} } \right)\) (depending on the location). The integer order differential equation for the LIM model implies that \(\beta_{h} =\) 2 and \(\beta_{l} =\) 0 \(\left( {{\text{exactly}}, {\text{everywhere}}} \right)\). Note that \(\beta_{h}\) is the value for a turbulent system, it corresponds to a highly intermittent process, not a process that is close to the integral of white noise (i.e. an Ornstein–Uhlenbeck process). LIM’s exactly flat spectral behavior at low frequencies is a consequence of the fact that the highest order differential term is integer ordered, it implies that the low frequencies are (unpredictable) white noise. For times much larger than the decorrelation time, temperature forecasts have no skill. LIM’s short memory behavior can be modeled as a Markov process, equivalently as an autoregressive or moving average process.

There are many empirical results that show a non-flat scaling behavior in the temperature spectrum (as well as in many other atmospheric variables) with values for \(\beta_{l}\) from 0.2 to 0.8 [see the review in Lovejoy and Schertzer (2013), also Lovejoy et al. (2018)]. This power-law behavior in the spectrum (and in the autocorrelation function) reflects the long-range memory that must be modelled. To appreciate the importance of the value of \(\beta_{l}\) for Gaussian processes, when \(\beta_{l} =\) 0, there is no predictability, and when \(\beta_{l} =\) 1, there is infinite predictability. The long memory effects mean that the equations become non-Markovian and that also past states need to be considered in order to predict the behavior of the system. The generalization of LIM’s integer ordered differential equations to include fractional order derivatives already introduces power-law correlations, the simplest option being to retain the simplest (Gaussian) assumption about the noise forcing. This is the main idea behind the ScaLIng Macroweather Model (SLIMM) (Lovejoy et al. 2015).

In the macroweather regime intermittency is generally low enough that a Gaussian model with long-range statistical dependency is a workable approximation [except perhaps for the extremes; e.g. the review (Lovejoy et al. 2018)]. Some attempts have been made to use Gaussian models for prediction in the mean square prediction framework of autoregressive fractional integrated moving average (ARFIMA) processes (Baillie and Chung 2002; Yuan et al. 2015). The theory behind some of these models only applies to stationary series, while, for example, in the case of globally-averaged temperature time series, there is clearly an increasing trend due to the anthropogenic warming in recent decades. If the trend is not properly removed, the assumption of random equally distributed variables no longer applies, and the skill of the predictions is adversely affected. The ScaLIng Macroweather Model (SLIMM), (Lovejoy et al. 2015) was the first of such models that took all these facts into consideration and offered a complete evaluation of the prediction skill based on hindcasts after the removal of the anthropogenic warming part.

SLIMM is a model for the prediction of stationary series with Gaussian statistics and scaling symmetry of the fluctuations. It proposes a predictor as a linear combination of past data (or past innovations). For the case of Gaussian variables, it has been proven that this kind of linear predictor is optimal in the mean square error sense [see the “Fundamental note” in page 264 of Papoulis and Pillai (2002)]. That is, if any other functional form (i.e. nonlinear) is used to build a predictor based on past data, the mean square error of the predictions will be larger than with the linear combination. This is not necessarily true if the distribution of the variables is not Gaussian, for example, in the case of multifractal processes, where the second moment statistics are not sufficient to describe the process.

Similarly to the spectrum where \(E_{T} \left( \omega \right)\sim \omega^{ - \beta }\), in the macroweather regime the average of the fluctuations as a function of the time scale also presents a power-law (scaling) behavior with \(\left\langle {\Delta T\left( {\Delta t} \right)} \right\rangle \sim \Delta t^{H}\). Besides the scale-invariance, low intermittency (rough Gaussianity) in time, is another characteristic of the macroweather regime. For Gaussian processes, the spectrum and the fluctuation exponents are related by \(H = \left( {\beta_{l} - 1} \right)/2\). In Lovejoy et al. (2015) SLIMM was introduced, based on fractional Gaussian noise (fGn), as the simplest stochastic model that includes both characteristics.

For their relevance to the current work, some properties of fGn presented in that paper are summarized here; for an extensive mathematical treatment see Biagini et al. (2008).

Over the range \(- 1 < H < 0\), an fGn process, \(G_{H} \left( t \right)\), is the solution of a fractional order stochastic differential equation of order \(H + 1/2\), driven by a unit Gaussian \(\delta\)-correlated white noise process, \(\gamma \left( t \right)\), (with \(\left\langle {\gamma \left( t \right)} \right\rangle = 0\) and \(\left\langle {\gamma \left( t \right)\gamma \left( {t'} \right)} \right\rangle = \delta \left( {t - t'} \right)\), where \(\delta \left( t \right)\) is the Dirac function):

$$\frac{{d^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} G_{H} \left( t \right)}}{{dt^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} }} = c_{H} \gamma \left( t \right),$$
(1)

where:

$$c_{H}^{2} = \frac{\pi }{{2\cos \left( {\pi H} \right)\varGamma \left( { - 2 - 2H} \right)}},$$
(2)

and \({{\varGamma }}\left( x \right)\) is the Euler gamma function. The value for the constant \(c_{H}\) was the standard one chosen to make the expression for the statistics particularly simple, see below. The fractional differential equation (Eq. (1)) was presented in Lovejoy et al. (2015) as a generalization of the LIM integer order equation to account for the power-law behavior observed for the spectrum at frequencies \(\omega > \omega_{w} \approx 1/\tau_{w}\). Physically it could model a scaling heat storage mechanism.

Integrating Eq. (1), we obtain:

$$G_{H} \left( t \right) = \frac{{c_{H} }}{{\varGamma \left( {H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}} \right)}}\int\limits_{ - \infty }^{t} {\left( {t - t^{\prime}} \right)^{{ - \left( {{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2} - H} \right)}} \gamma \left( {t^{\prime}} \right)dt^{\prime}} .$$
(3)

In other words, \(G_{H} \left( t \right)\) is the fractional integral of order \(H + 1/2\) of a white noise process, which can also be regarded as a smoothing of a white noise with a power-law filter. The process \(\gamma \left( t \right)\) is a particular case of \(G_{H} \left( t \right)\) for \(H = - 1/2\). Just as \(\gamma \left( t \right)\) is a generalized stochastic process (a distribution), the process \(G_{H} \left( t \right)\) is also a generalized function without point-wise values. It is the density of the well-known fractional Brownian motion (fBm) measures, \(B_{{H^{\prime}}} \left( t \right)\), with \(H^{\prime} = H + 1\), i.e. \(dB_{{H^{\prime}}} \left( t \right) = G_{H} \left( t \right)dt\) (Wiener integrals for the case \(H^{\prime} = 1/2\)). The derivative of a distribution (in this case \(B_{{H^{\prime}}} \left( t \right)\)) is formally defined from the following:

$$\int {\varphi \left( t \right)dB_{{H^{\prime}}} \left( t \right)} = \int {\varphi \left( t \right)G_{H} \left( t \right)dt} = - \int {\varphi^{\prime}\left( t \right)B_{{H^{\prime}}} \left( t \right)dt} ,$$
(4)

where \(\varphi \left( t \right)\) is any locally integrable function.

From this relation to fBm, the resolution \(\tau\) (smallest sampling temporal scale) fGn process, \(G_{H,\tau } \left( t \right)\), can be defined, either as an average of \(G_{H} \left( t \right)\), or from the increments of the fBm process, \(B_{{H^{\prime}}} \left( t \right)\), at the same resolution:

$$G_{H,\tau } \left( t \right) = \frac{1}{\tau }\int\limits_{t - \tau }^{t} {G_{H} \left( {t^{\prime}} \right)dt^{\prime}} = \frac{1}{\tau }\int\limits_{t - \tau }^{t} {dB_{{H^{\prime}}} \left( {t^{\prime}} \right)} = \frac{1}{\tau }\left[ {B_{{H^{\prime}}} \left( t \right) - B_{{H^{\prime}}} \left( {t - \tau } \right)} \right].$$
(5)

In Lovejoy et al. (2015) it was shown that, for resolution \(\tau > \tau_{w}\), we can model the globally-averaged macroweather temperature as:

$$T_{\tau } \left( t \right) = \sigma_{T} G_{H,\tau } \left( t \right),$$
(6)

where \(- 1 < H < 0\) and \(\sigma_{T}\) is the temperature variance (for \(\tau = 1\)). The parameter \(H\), defined in this range, is not the more commonly used Hurst exponent for fBm processes, \(H^{\prime}\), but the fluctuation exponent of the corresponding fractional Gaussian noise process. Fluctuations exponents are used due to their wider generality; they are well defined even for strongly non-Gaussian processes. For a discussion see page 643 in (Lovejoy et al. 2015).

Assuming \(\tau\) is the smallest scale in our system with the property \(\tau > \tau_{w}\) (e.g. \(\tau = 1\) month for air temperature), the temperature defined by Eq. (6) has the following properties:

$$\begin{aligned}&{\text{(i)}}\;T_{\tau } \left( t \right)\;{\text{is a Gaussian stationary process with continuous paths. }} \\&{\text{(ii)}}\;\left\langle {T_{\tau } \left( t \right)} \right\rangle = 0 \;{\text{and}}\;\left\langle {T_{\tau } \left( t \right)^{2} } \right\rangle = \sigma_{T}^{2} \tau^{2H}; \; {\text{for all }}t,{\text{ the notation}} \; \left\langle \cdot \right\rangle \\ &{\text{denotes ensemble (infinite realizations) averaging.}} \\& {\text{(iii)}}\;C_{{H,\sigma_{T} }} \left( {\Delta t} \right) = \left\langle {T_{\tau } \left( t \right)T_{\tau } \left( {t + \Delta t} \right)} \right\rangle \\ &=\frac{\sigma_{T}^{2}}{2\tau^{2}} \left( {\left| {\Delta t + \tau } \right|^{2H + 2} + \left| {\Delta t - \tau } \right|^{2H + 2} - 2\left| {\Delta t} \right|^{2H + 2} } \right);\\&{\text{ for}}\;\Delta t \ge\tau.\end{aligned}$$
(7)

For more details see Mandelbrot and Van Ness (1968), Gripenberg and Norros (1996) and Biagini et al. (2008).

From Eq. (7.iii), the behavior of the autocovariance function for \(\Delta t \gg \tau\) and \(- 1 < H < 0\) is:

$$C_{{H,\sigma_{T} }} \left( {\Delta t} \right) \approx \sigma_{T}^{2} \left( {H + 1} \right)\left( {2H + 1} \right)\Delta t^{2H}$$
(8)

and the corresponding spectrum for low frequencies is:

$$E_{T} \left( \omega \right) \approx {{\varGamma \left( {3 + 2H} \right)\sin \left( {\pi H} \right)\omega^{{ - \beta_{l} }} } \mathord{\left/ {\vphantom {{\varGamma \left( {3 + 2H} \right)\sin \left( {\pi H} \right)\omega^{{ - \beta_{l} }} } {\sqrt {2\pi } }}} \right. \kern-0pt} {\sqrt {2\pi } }},$$
(9)

where \(\beta_{l} = 1 + 2H\).

Combining Eqs. (3), (5) and (6), we get the following explicit integral expression for the temperature at resolution \(\tau\):

$$T_{\tau } \left( t \right) = \frac{1}{\tau }\frac{{c_{H} \sigma_{T} }}{{\varGamma \left( {H + {3 \mathord{\left/ {\vphantom {3 2}} \right. \kern-0pt} 2}} \right)}}\left[ {\int\limits_{ - \infty }^{t} {\left( {t - t^{\prime}} \right)^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} \gamma \left( {t^{\prime}} \right)dt^{\prime}} - \int\limits_{ - \infty }^{t - \tau } {\left( {t - \tau - t^{\prime}} \right)^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} \gamma \left( {t^{\prime}} \right)dt^{\prime}} } \right].$$
(10)

Notice that \(T_{\tau } \left( t \right)\) is obtained from the difference of fractional integrals of order \(H + 3/2\) of a white noise process. Our definition of \(c_{H}\) in Eq. (2) implies that \(\left\langle {T_{\tau } \left( t \right)^{2} } \right\rangle = \sigma_{T}^{2} \tau^{2H}\). As \(H < 0\), it follows that, in the small-scale limit (\(\tau \to 0\)), the variance diverges and \(H\) is the scaling exponent of the root mean square (RMS) value. This singular small-scale behavior is responsible for the strong power-law resolution effects in fGn. For a detailed discussion on this important resolution effect that leads to a “space–time reduction factor” and its implications for the accuracy of global surface temperature datasets, see Lovejoy (2017).

Using the fact that \(T_{\tau } \left( t \right)\) is a Gaussian stationary process, Lovejoy et al. (2015) derived a formula for the predictor of the temperature at some time \(t \ge \tau\), given that data are available over the entire past (i.e. from \(t = - \infty\) to \(0\)). From Eq. (10), the mean square (MS) estimator for the temperature can be expressed as:

$$\hat{T}_{\tau } \left( t \right) = \frac{1}{\tau }\frac{{c_{H} \sigma_{T} }}{{\varGamma \left( {H + {3 \mathord{\left/ {\vphantom {3 2}} \right. \kern-0pt} 2}} \right)}}\int\limits_{ - \infty }^{0} {\left[ {\left( {t - t^{\prime}} \right)^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} - \left( {t - \tau - t^{\prime}} \right)^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right]\gamma } \left( {t^{\prime}} \right)dt^{\prime}.$$
(11)

As a measure of the skill of the model, we can use the mean square skill score (\({\text{MSSS}}\)), defined as:

$${\text{MSSS}}\left( {t,\tau } \right) = 1 - \frac{{\left\langle {\left[ {T_{\tau } \left( t \right) - \hat{T}_{\tau } \left( t \right)} \right]^{2} } \right\rangle }}{{\left\langle {T_{\tau } \left( t \right)^{2} } \right\rangle }},$$
(12)

i.e. one minus the normalized mean square error (\({\text{MSE}}\)). Here \(T_{\tau } \left( t \right)\) represents the verification and \(\hat{T}_{\tau } \left( t \right)\) the forecast at time \(t \ge \tau\). The reference forecast would be the average of the series \(\left\langle {T_{\tau } \left( t \right)} \right\rangle = 0\), for which the \({\text{MSE}}\) is the variance \(\left\langle {T_{\tau } \left( t \right)^{2} } \right\rangle\). Using Eqs. (10) and (11) in (12), an analytical expression for the \({\text{MSSS}}\) can be obtained:

$${\text{MSSS}}_{H} \left( {{t \mathord{\left/ {\vphantom {t \tau }} \right. \kern-0pt} \tau }} \right) = \frac{{F_{H} \left( \infty \right) - F_{H} \left( {{t \mathord{\left/ {\vphantom {t \tau }} \right. \kern-0pt} \tau }} \right)}}{{F_{H} \left( \infty \right) + \tfrac{1}{2H + 2}}},$$
(13)

where \(t \ge \tau\) and

$$F_{H} \left( t \right) = \int\limits_{0}^{t - 1} {\left( {\left( {1 + u} \right)^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} - u^{{H + {1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2}}} } \right)^{2} du} ;$$
(14)

in particular,

$$F_{H} \left( \infty \right) = \frac{{\varGamma \left( {{3 \mathord{\left/ {\vphantom {3 2}} \right. \kern-0pt} 2} + H} \right)\varGamma \left( { - 2H} \right)}}{{\left( {2H + 2} \right)\varGamma \left( {{1 \mathord{\left/ {\vphantom {1 2}} \right. \kern-0pt} 2} - H} \right)}} - \frac{1}{2H + 2}.$$
(15)

Although Eq. (11) is the formal expression for the predictor of the temperature, from a practical point of view it has two clear disadvantages: it is expressed as an integral of the unknown past innovations, \(\gamma \left( t \right)\), and it assumes the knowledge of these innovations for an infinite time in the past. It would be more natural to express the predictor as a function of the observed part of the process. This problem was solved for fBm processes with \(1/2 < H' < 1\) (equivalently \(- 1/2 < H < 0\)) by Gripenberg and Norros (1996). The explicit formula they found for the predictor, \(\hat{B}_{{H^{\prime},a}} \left( t \right)\), of the fBm process, \(B_{{H^{\prime}}} \left( t \right)\), known in the interval \(\left( { - a,0} \right)\) for \(t > 0\) and \(a > 0\), is:

$$\hat{B}_{H,a} \left( t \right) = \int\limits_{ - a}^{0} {g_{a} \left( {t,t^{\prime}} \right)B_{H} } \left( {t^{\prime}} \right)dt^{\prime},$$
(16)

where \({\text{g}}_{a} \left( {t,t'} \right)\) is an appropriate weight function given by:

$$g_{a} \left( {t, - t^{\prime}} \right) = \frac{{\sin \left[ {\pi \left( {H^{\prime} - 1/2} \right)} \right]}}{\pi }\left[t^{{\prime}} \left( {a - t^{\prime}} \right)\right]^{{-H^{\prime} + 1/2}} \int\limits_{0}^{t} {\frac{{\left[x \left( {x + a} \right)\right]^{{H^{\prime} - 1/2}} }}{{x + t^{\prime}}}dx} .$$
(17)

It is important to note that the weight function goes to infinity both at the origin and at \(- a\) [see Fig. 8 in Norros (1995)]. In their words, this divergence when we approach \(- a\) is because “the closest witnesses to the unobserved past have special weight”.

The results summarized in Eqs. (1017) are theoretically important, but, from the practical point of view of making predictions, a discrete representation of the process is needed. In the next sections, we present analogous results for the prediction of discrete-in-time, finite past fGn processes and its application to the modelling and prediction of global temperature time series.

2.2 StocSIPS

The theory presented in the previous section and the applicability of SLIMM is restricted to detrended time series with Gaussian statistics and a scaling behavior of the fluctuations. Real-world datasets, in particular raw temperature series, normally include periodic signals corresponding to the diurnal and the seasonal cycles. They are also affected by an increasing trend as a response to anthropogenic forcing and usually combine different scaling regimes depending on the temporal resolution used.

StocSIPS is the general system that includes SLIMM as the core model for the long-term prediction of atmospheric fields. In order to use SLIMM, some of the components of StocSIPS are dedicated to the “cleaning” of the original dataset. In particular, it includes techniques for removing and projecting the seasonality and the anthropogenic trend. It also degrades the temporal series to a scale where only one scaling regime with fluctuation exponent \(-1/2 < H < 0\) is present. The initial goal is to produce a temporal series that can be modelled and predicted with the stationary fGn process using the SLIMM theory. Some other aspects of StocSIPS—not discussed in this paper—include the addition of another space–time symmetry [the statistical space–time factorization (Lovejoy and de Lima 2015; Lovejoy et al. 2018)] for the regional prediction, and the combination as copredictors of different atmospheric fields.

One of the objectives of this paper is to show the improvements in the theoretical treatment and in the numerical methods of SLIMM as an essential part of StocSIPS. These recent developments have helped to produce faster and more accurate predictions of global temperature. The improvement in SLIMM and some of the preprocessing techniques are illustrated later on in Sect. 3 through an application to the forecast of globally-averaged temperature series.

2.2.1 Discrete-in-time fGn processes

As we showed in Sect. 2.1, for predicting the stationary component of the temperature with resolution \(\tau\) at a future time \(t > 0\), the linear predictor, \(\hat{T}_{\tau } \left( t \right)\), based on past data (\(T_{\tau } \left( s \right)\) for \(- a < s \le 0\)) satisfying the minimum mean square error condition (orthogonality principle between the error and the data) can then be written as:

$$\hat{T}_{\tau } \left( t \right) = \int\limits_{ - a}^{0} {M_{T} \left( {t,s} \right)T_{\tau } } \left( s \right)ds,$$
(18)

or equivalently, based on the past innovations, \(\gamma \left( s \right)\):

$$\hat{T}_{\tau } \left( t \right) = \int\limits_{ - a}^{0} {M_{\gamma } \left( {t,s} \right)\gamma } \left( s \right)ds,$$
(19)

where \(M_{T} \left( {t,s} \right)\) and \(M_{\gamma } \left( {t,s} \right)\) are appropriated weight functions. In SLIMM, the predictor given by Eq. (11) is a particular case of Eq. (19) for \(a = \infty\) and \(M_{\gamma } \left( {t,s} \right) = c_{H} \sigma_{T} \left[ {\left( {t - s} \right)^{H + 1/2} - \left( {t - \tau - s} \right)^{H + 1/2} } \right]/\tau {{\varGamma }}\left( {H + 3/2} \right),\) while the solution in Gripenberg and Norros (1996) (Eq. (16) here) is the case of Eq. (18) for an fBm process with \(M_{T} \left( {t,s} \right)\) analogous to \({\text{g}}_{a} \left( {t,t'} \right)\) given by Eq. (17).

The mathematical theory presented in Sect. 2.1 is general for a continuous-in-time fGn. Moreover, the integral representation of fGn given by Eq. (10), is based on an infinite past of continuous innovations, \(\gamma \left( t \right)\). For applications to real-world data, a discrete version of the problem is needed for the case of fGn with finite data in the past (\(a < \infty\)). In practice, in the case of temperature (and any other atmospheric field) we only have measurements at discrete times with some resolution over a limited period. For modeling these fields, we can consider discrete-in-time fGn process as a more suitable model.

Assuming that we have already removed the low-frequency anthropogenic component of the temperature series (see Sect. 3.2), in the discrete case, we could express the zero mean detrended component by its moving average (MA(\(\infty\))) stochastic representation given by the Wold representation theorem (Wold 1938):

$$T_{t} = \sum\limits_{j = - \infty }^{t} {\varphi_{t - j} \gamma_{j} } ,$$
(20)

where \(\left\{ {\varphi_{t} } \right\}\) are weight parameters with units of temperature and \(\left\{ {\gamma_{t} } \right\}\) is a white noise sequence with \(\gamma_{t} \sim {\text{NID}}\left( {0,1} \right)\) and \(\left \langle \gamma_i \gamma_j \right \rangle=\delta_{i j}\), where \(\delta_{ij}\) is the Kronecker delta and \({\text{NID}}\left( {\mu ,\sigma^{2} } \right)\) stands for normally and independently distributed with mean \(\mu\) and variance \(\sigma^{2}\) (the sign \(\sim\) means equal in distribution). This equation is analogous to Eq. (10) for the continuous case.

By inverting Eq. (20) we can obtain the equivalent autoregressive (AR(\(\infty\))) representation (Palma 2007):

$$T_{t} = \sigma_{0} \gamma_{t} + \sum\limits_{j = - \infty }^{t - 1} {\pi_{t - j} T_{j} } ,$$
(21)

which is more suitable for predictions, as any value of the series is given as a linear combination of the values in the past. In this representation the weights \(\left\{ {\pi_{t} } \right\}\) are unitless.

In practice, we only have a finite stretch of data \(\left\{ {T_{ - t} , \ldots ,T_{0} } \right\}\). Under this circumstance, the optimal k-steps Wiener predictor for \(T_{k}\) (\(k > 0\)), based on the finite past, is given by:

$$\hat{T}_{t} \left( k \right) = \sum\limits_{j = - t}^{0} {\phi_{t,j} \left( k \right)T_{j} } = \phi_{t, - t} \left( k \right)T_{ - t} + \cdots + \phi_{t,0} \left( k \right)T_{0} ,$$
(22)

where the new vector of coefficients, \({\mathbf{\phi }}_{{{\kern 1pt} t}} \left( k \right) = \left[ {\phi_{t, - t} \left( k \right), \ldots ,\phi_{t,0} \left( k \right)} \right]^{T}\) (the superscript \(T\) denotes transpose operation) satisfies the Yule-Walker equations [see page 96 in Hipel and McLeod (1994)]:

$${\mathbf{R}}_{{H,\sigma _{T} }}^{t} {\mathbf{\phi }}_{{{\kern 1pt} t}} \left( k \right) = {\mathbf{C}}_{{H,\sigma _{T} }}^{t} \left( k \right),$$
(23)

with \({\mathbf{C}}_{{H,\sigma_{T} }}^{t} \left( k \right) = \left[ {C_{{H,\sigma_{T} }} \left( {k - i} \right) } \right]_{i = - t, \ldots ,0}^{T} = \left[ {C_{{H,\sigma_{T} }} \left( {t + k} \right), \ldots ,C_{{H,\sigma_{T} }} \left( k \right) } \right]^{T}\) and \({\mathbf{R}}_{{H,\sigma_{T} }}^{t} = \left[ {C_{{H,\sigma_{T} }} \left( {i - j} \right)} \right]_{i,j = - t, \ldots ,0}\) being the autocovariance matrix. The elements \(C_{{H,\sigma_{T} }} \left( {\Delta t} \right)\) are obtained from Eq. (7.iii) where we assume \(\tau = 1\) is the smallest scale in our system with the property \(\tau \gg \tau_{w}\) (e.g. \(\tau = 1\) month).

Notice that the coefficients \(\left\{ {\phi_{t,j} } \right\}\) will only depend on \(H\) [\(\sigma_{T}\) cancels in both sides of Eq. (23)] and further that they are not the same as the coefficients \(\left\{ {\pi_{t} } \right\}\), for which the complete knowledge of the infinite past is assumed. The coefficients \(\left\{ {\pi_{t} } \right\}\) decrease monotonically as we go further in the past, while this is not the case for the coefficients \(\left\{ {\phi_{t,j} } \right\}\), as we can see in Fig. 1 for the cases where \(H = -\, 0.1, -\, 0.25, -\, 0.4\), and we predict \(k = 12\) steps in the future by using a series of \(t + 1 = 36\) values. Notice how the memory effect (the weight of the coefficients) increases with the value of \(H\). This behavior of the coefficients is analogous to the one mentioned earlier for the function \({\text{g}}_{a} \left( {t,t'} \right)\) (Eq. (17)). As found in Gripenberg and Norros (1996) for the continuous-in-time case, not only is there a strong weighting of the recent data, but the most ancient available data also have singular weights [compare Fig. 1 here with Fig. 3.1 in (Gripenberg and Norros 1996)].

Fig. 1
figure 1

Optimal coefficients, \(\phi_{t,j}\), in Eq. (17) with \(H = - 0.1, - 0.25, - 0.4\) (top to bottom) for predicting \(k = 12\) steps in the future by using the data for \(j = - 35, \ldots ,0\) in the past. Notice the strong weighting on both the most recent (right) and the most ancient available data (left) and how the memory effect decreases with the value of \(H\). Compare to Fig. 3.1 in Gripenberg and Norros (1996)

This behavior of the coefficients for fGn is the main difference (and a clear advantage) over other autoregressive models (AR, ARMA) which do not include fractional integrations accounting for the long-term memory and do not consider the information from the distant past. An additional limitation of these approaches is that for each \(\Delta t\), the values for \(C\left( {\Delta t} \right) = \left\langle T_{\tau } \left( t \right)T_{\tau } \left( {t + \Delta t} \right) \right\rangle\) must be estimated directly from the data. Each \(C\left( {\Delta t} \right)\) will have its own error, this effectively introduces a large “noise” in the predictor estimates. In addition, it is computationally expensive if a large number of coefficients are needed. In our fGn model the coefficients have an analytic expression which only depends on the fluctuation exponent, \(H\), obtained directly from the data exploiting the scale-invariance symmetry of the fluctuations; our problem is a statistically highly constrained problem of parametric estimation (\(H\)), not an unconstrained one (the entire \(C\left( {\Delta t} \right)\) function).

In the discrete case, the mean square skill score, defined by Eq. (12), has the following analytical expression:

$${\text{MSSS}}_{H}^{t} \left( k \right) = {\tilde{\mathbf{C}}}_{H}^{t} \left( k \right)^{T} \left( {{\tilde{\mathbf{R}}}_{H}^{t} } \right)^{ - 1} {\tilde{\mathbf{C}}}_{H}^{t} \left( k \right),$$
(24)

where \({\tilde{\mathbf{C}}}_{H}^{t} \left( k \right) = \left[ {{\tilde{\mathbf{C}}}_{H} \left( {k - i} \right) } \right]_{i = - t, \ldots ,0}^{T}\) is a vector formed by the autocorrelation function \(\tilde{C}_{H} \left( {\Delta t} \right) = C_{{H,\sigma_{T} }} \left( {\Delta t} \right)/\sigma_{T}^{2}\) (see Eq. (7.iii)) and \({\tilde{\mathbf{R}}}_{H}^{t} = {\mathbf{R}}_{{H,\sigma_{T} }}^{t} /\sigma_{T}^{2} = \left[ {\tilde{C}_{H} \left( {i - j} \right)} \right]_{i,j = - t, \ldots ,0}\) is the autocorrelation matrix. For a given horizon in the future, \(k\), the \({\text{MSSS}}\) will only depend on the exponent, \(H\), and the extension of our series in the past, \(t\).

In the previous equations, the full length of our known series was \(t + 1\), but we don’t necessarily have to use the complete series to build our predictor. It is enough to use a number \(m + 1\) of points in the past (memory) with \(m < t\). The new predictor and skill score are obtained by just replacing \(t\) by \(m\) in Eqs. (2224). By doing this, we can use the remaining \(t - m - 1\) points for hindcast verifications.

For the case where \(H = -\, 0.25\) and \(k = 3\), Fig. 2 shows how the \({\text{MSSS}}\) approaches the asymptotic value corresponding to an infinite past as we increase the amount of memory we use. The dashed line represents the \({\text{MSSS}}\) for \(m = 500\) and the dotted line is the value we obtain using Eq. (13) for the continuous-in-time case with the infinite past known. The difference between the two is not due to the finite memory (\(m = 500\)) we have in the discrete case with respect to the infinite past assumed in Eq. (13), but to intrinsic differences due to the discretization and more related to the high-frequency information loss because of the smoothing from a continuous to a discrete process. Note that we do not need to use a large memory to achieve a skill close to the asymptotic value. In this example where \(H = -\, 0.25\), we only need to use \(m \ge 22\) for \(k = 3\) to get more than 95% of the maximum skill.

Fig. 2
figure 2

\({\text{MSSS}}_{H}^{m} \left( k \right)\) as a function of the memory, \(m\), for the case where \(H = - \,0.25\) and \(k = 3\). The dashed line represents the \({\text{MSSS}}\) for \(m = 500\) and the dotted line is the value obtained with Eq. (12) for the continuous-in-time case. For \(m = 22\), more than 95% of the asymptotic skill is achieved

The amount of memory needed depends on the value of \(H\), as we can see in Fig. 3, where we plot the minimum memory needed, \(m_{95\% }\), to get more than 95% of the asymptotic value (corresponding to \(m = \infty\)) as a function of the horizon, \(k\), for different values of \(H\). The line \(m = 15k\) was also included for reference. The larger the value of the exponent, \(H\), (the closer to zero) the less memory we need to approach the maximum possible skill. This fact seems counterintuitive, but the explanation is simple: for larger values of \(H\) (closer to zero), the influence of values farther in the past is stronger, but at the same time, more information of those values is included in the recent past, so less memory is needed for forecasting. Following the rule of thumb found by Norros (1995) for the continuous case: “one should predict (…) the next second with the latest second, the next minute with the latest minute, etc.” Actually, from Fig. 3 we can conclude than, for predicting \(k\) steps into the future, a memory \(m = 15k\) would be a safe minimum value for achieving almost the maximum possible skill for any value of \(H\) in the range \(\left( {-1/2, 0} \right)\), which is the case for temperature and many other atmospheric fields. Of course, if \(H\) is close to zero a much smaller value could be taken. The approximate ratio \(m_{95\% } /k\) for each \(H\) was included at the top of the respective curve. From the point of view of the availability of data for the predictions, this result is important. Once the value for \(H\) is estimated, assuming it remains stable in the future, we only need a few of recent datapoints to forecast the future temperature. The information of the unknown data from the distant past is automatically considered by the model.

Fig. 3
figure 3

Minimum memory, \(m\), needed to get more than 95% of the asymptotic value (corresponding to \(m = \infty\)) as a function of the horizon, \(k\), for different values of \(H\). The larger the value of \(H\) (the closer to zero) the less memory is needed for a given horizon. The approximate ratio \(m_{95\% } /k\) for each \(H\) was included at the top of the respective curve

Previously, we showed that an fGn process is fully characterized by its autocovariance function, which in turn depends only on the covariance, \(\sigma_{T}^{2}\), and the fluctuation exponent \(H\). To extend our description to more general cases, we could allow our series to have a non-zero ensemble mean, \(\mu\). This family of three parameters defines our fGn process and represents the link between the mathematical model and real-world historical data.

In Appendix 1 we discuss how to obtain maximum likelihood estimates (MLE) for these parameters on a given time series. For the fluctuation exponent, we show other approximate (and less computationally expensive) methods. We can use Eq. (9) to obtain \(\hat{H}_{s} = \left( {\beta _{l} - 1} \right)/2\) from the spectrum exponent at low frequencies. This method, as well as the Haar wavelet analysis to obtain an estimate \(\hat{H}_{h}\) from the exponent of the Haar fluctuations, was used in Lovejoy and Schertzer (2013) and Lovejoy et al. (2015) to obtain estimates of \(H\) for average global and Northern Hemisphere anomalies. A Quasi Maximum Likelihood Estimate (QMLE) method is also discussed in Appendix 1. The latter is more accurate than the Haar fluctuations and the spectral analysis methods and is obtained as part of the hindcast verification process. Nevertheless, those two have the advantage of being more general and applicable to any scaling process (even highly nonGaussian ones).

All these methods were applied to fGn simulations and the parameters estimated were summarized in Table 4. The technical details for producing exact simulations are also discussed in Appendix 1. Finally, we show how to to check the adequacy of the fitted fGn model to real-world data and we derive some ergodic properties of fGn processes. Specifically, we show that the temporal average standard deviation squared, \(SD_{T}^{2} = \mathop \sum \nolimits_{t = 1}^{N} \left( {T_{t} - \bar{T}_{N} } \right)^{2} /N\), is a strongly biased estimate of the variance of the process, \(\sigma_{T}^{2}\), for values of \(H\) close to zero (the overbar denotes temporal averaging: \(\bar{T}_{N} = \mathop \sum \nolimits_{t = 1}^{N} T_{t} /N\)). The sample and the ensemble estimates are related by:

$$SD_{T}^{2} = \sigma_{T}^{2} \left( {1 - N^{2H} } \right).$$
(25)

When \(H = -\, 0.06\), \(N = 1656\) (values for the monthly series since 1880) there is a huge difference between the sample and the ensemble estimates (\(SD_{T}^{2} /\sigma_{T}^{2} = 0.59\)). Some skill scores (e.g. the \({\text{MSSS}}\) or the normalized mean squared error \({\text{NMSE}}\)) use the variance for normalization. The implications of the difference in the estimates of the variance on the definition of the \({\text{MSSS}}\) will be discussed in Sect. 3.4.3.

3 Forecasting global temperature anomalies

3.1 The data

The general framework presented here is applicable to forecasting any time series that satisfies, (a) the conditions of stationarity, (b) Gaussianity and (c) long-range dependence given by power-law behavior of the correlation function with fluctuation exponents in the range \(\left( { - 1/2, 0} \right)\). These three properties are well satisfied for globally-averaged temperature anomaly time series in the macroweather regime, from 10 days to some decades (Lovejoy and Schertzer 2013; Lovejoy et al. 2013, 2015). In the last three decades, there has been a growing literature showing that the temperature (and other atmospheric fields) are scaling in the macroweather regime (Koscielny-Bunde et al. 1998; Blender et al. 2006; Huybers and Curry 2006; Franzke 2012; Rypdal et al. 2013; Yuan et al. 2015) and see the extensive review in Lovejoy and Schertzer (2013). Strictly speaking, in the last century, low frequencies become dominated by anthropogenic effects and after 10–20 years the scaling regime changes from a negative to a positive value of \(H\), as we will show below. As was discussed in detail in Lovejoy (2014, 2017) and Lovejoy et al. (2015), differently from preindustrial epochs, recent temperature time series can be modeled by a trend stationary process, i.e. a stochastic process from which an underlying trend (function solely of time) can be removed, leaving a stationary process. In other words, to first order, variability is unaffected by climate change. The deterministic trend representing the response to external forcings can be removed by using CO2 radiative forcing as a good linear proxy for all the anthropogenic effects [or equivalent-CO2 (CO2eq) radiative forcing as the one used for CMIP5 simulation (Meinshausen et al. 2011)]. There is a nearly linear relation between the actual CO2 concentration and the estimated equivalent concentration which includes all anthropogenic forcings, including greenhouse gases, aerosols, etc. (Meinshausen et al. 2011).

In this paper, we limit our analysis to globally-averaged temperature anomaly time series at monthly resolution. This is a first step for checking the applicability of the model and at the same time providing an alternative method for obtaining long-term forecasts. The quality of our method can be assessed based on the skill obtained from hindcasts verification and its agreement with the theoretical prediction.

There are five major observation-based global temperature datasets which are in common use. They are (a) the NASA Goddard Institute for Space Studies Surface Temperature Analysis (GISTEMP) series, abbreviated NASA and NASA-L in the following for global and land surface averages respectively (Hansen et al. 2010; GISTEMP Team 2018), (b) the NOAA NCEI series GHCN-M version 3.3.0 plus ERSST dataset (Smith et al. 2008; NOAA-NCEI 2018), updated in Gleason et al. (2015), abbreviated NOAA and NOAA-L (global and land surface averages, as before), (c) the Combined land and sea surface temperature (SST) anomalies from CRUTEM4 and HadSST3, Hadley Centre—Climatic Research Unit Version 4, abbreviated HAD4 and HAD4-L (Morice et al. 2012; Met Office Hadley Centre 2018), (d) the version 2 series of (Cowtan and Way 2014, 2018), abbreviated CowW and CowW-L, and (e) the Berkeley Earth series (Rohde et al. 2013; Berkeley Earth 2018), abbreviated Berk and Berk-L. The average of the global and the land surface series were included in the analysis and we use for the abbreviations Mean-G and Mean-L, respectively.

All these series are of anomalies, i.e. the difference between temperature at a given time and the average during a baseline period. They tend not to be on the same baseline; for NASA and Berk the reference period is 1951–1980, for HAD4 and CowW it is 1961–1990, and for NOAA it is the 20th century (1901–2000). To compare them, we need to use the same zero point. In this case we chose the 20th century average as a common reference period. The average temperature for 1901–2000 is nearly the same as that for 1951–1980, while that of more recent times (1961–1990) is warmer.

Each series spans a somewhat different period: HAD4, CowW and Berk start first, beginning in 1850, NASA and NOAA both start in 1880. When the data were accessed on May 21, 2018, they were all available at monthly resolutions until April 2018. Only the period January 1880–December 2017 was analyzed, i.e. 138 years = 1656 months (the same length that was used in the simulations in Appendix 1). These series (updated until 2012), together with twentieth century reanalysis global average, were used in (Lovejoy 2017) to assess how accurate are the data as functions of their time scale. As it was pointed out in the latter, each data set has its strengths and weaknesses and it is precisely their degree of agreement or disagreement what permits us to evaluate the intrinsic absolute uncertainty in the estimates of the global temperature.

In Fig. 4 we show the global average temperature (bottom) and the land surface average temperature (top). In red are the means of the five global datasets for global and for land, respectively, and in blue are a measure of their level of dispersion given by the standard deviations. The datasets are most dissimilar before 1900, which could be due to the lack of reliable measurements, but otherwise, the overall level of agreement is very good [about ± 0.05 °C and is nearly independent of scale for the global temperature series (Lovejoy 2017)]. Each series shows warming during the last decades, and they all show fluctuations superimposed on the warming trend.

Fig. 4
figure 4

Monthly surface temperature anomaly series from 1880 to 2017. In red is the mean of the five datasets for global (bottom): NASA, NOAA, HAD4, CowW, and Berk, and for land (top): NASA-L, NOAA-L, HAD4-L, CowW-L, and Berk-L. The dispersion among the series—given by the standard deviations of the five series as a function of time—is shown in blue. Each series represents the anomaly with respect to the mean of the reference period 1901–2000

3.2 Removing the anthropogenic component

In the present case of globally-averaged temperatures, the seasonality in the time series is weak. The deterministic annual cycle component was removed first from the original series. It was estimated from the average of every month for the full period of 138 years (1880–2017). Cross-validation effects are weak for such a long reference period and were not considered.

Because of the anthropogenically induced trends in addition to internal macroweather variability, global temperature time series have low-frequency forced variability. A simple application of the linearity of the climate response to external forcings yields:

$$T\left( t \right) = T_{\text{anth}} \left( t \right) + T_{\text{nat}} \left( t \right),$$
(26)

which considers the temperature as a combination of a purely deterministic response to anthropogenic forcings, \(T_{\text{anth}}\), plus a strict stationary stochastic component, \(T_{\text{nat}}\), with zero mean. The low frequency component can be obtained as:

$$T_{\text{anth}} \left( t \right) = \lambda_{{2 \times {\text{CO}}_{ 2} {\text{eq}}}} \log_{2} \left[ {{{\rho_{{{\text{CO}}_{ 2} {\text{eq}}}} \left( t \right)} \mathord{\left/ {\vphantom {{\rho_{{{\text{CO}}_{ 2} {\text{eq}}}} \left( t \right)} {\rho_{{{\text{CO}}_{ 2} {\text{eq,pre}}}} }}} \right. \kern-0pt} {\rho_{{{\text{CO}}_{ 2} {\text{eq,pre}}}} }}} \right] + T_{0} ,$$
(27)

where \(\rho_{{{\text{CO}}_{2} {\text{eq}}}}\) is the observed globally-averaged equivalent-CO2 concentration with preindustrial value \(\rho_{{{\text{CO}}_{2} {\text{eq}},{\text{pre}}}} = 277 \,{\text{ppm}}\) and \(\lambda_{{2 \times {\text{CO}}_{2} {\text{eq}}}}\) is the transient climate sensitivity (that excludes delayed responses) related to the doubling of atmospheric equivalent-CO2 concentrations. For \(\rho_{{{\text{CO}}_{2} {\text{eq}}}}\) we used the CMIP5 simulation values (Meinshausen et al. 2011). The definition of CO2eq here includes not only greenhouse gases, but also aerosols, with their corresponding cooling effect. The reference value \(T_{0}\) is chosen so that \(\bar{T}_{\text{nat}} = 0\), (the overbar indicates temporal averaging). The parameters \(\lambda_{{2 \times {\text{CO}}_{2} {\text{eq}}}}\) and \(T_{0}\) are estimated from the linear regression of \(T\left( t \right)\) vs. \(\log_{2} \left[ {\rho_{{{\text{CO}}_{2} {\text{eq}}}} \left( t \right)/\rho_{{{\text{CO}}_{2} {\text{eq}},{\text{pre}}}} } \right]\). The residuals are the stochastic natural variability component, \(T_{\text{nat}}\).

The natural variability includes “internal” variability and the response of the system to natural forcings: solar and volcanic. There is no gain in trying to model the responses to these two natural forcings independently. They would represent unpredictable signals while the ensemble of \(T_{\text{nat}}\) can be directly modelled using the techniques discussed in Sect. 2 for fGn processes. We made some experiments trying to predict the internal variability and the solar and the volcanic responses independently, and the combined error was larger than if we try to forecast the natural variability component as a whole. On the other hand, the relatively smooth dependence of the anthropogenic component makes it easy to project it a few years into the future with good accuracy.

As an example, the temperature anomalies for the global average dataset (Mean-G) is shown in Fig. 5 (red in the online version) together with the CO2eq response to anthropogenic forcings (dashed, black) and the residual natural variability component (blue). To use CO2 instead of CO2eq forcings leads to almost the same residuals due to the nearly linear relation between the two, but it avoids the uncertainties due to the estimation of the cooling effects of the aerosols as well as other radiative assumptions. The CO2 forcing is taken as a surrogate for all the anthropogenic forcings. The focus of this work is to model and forecast the residuals (natural variability), and for that purpose, either of the two concentrations would lead to the same residuals (they differ by a factor of 1.12 over the last century). From a direct inspection of Fig. 5, it is clear that a CO2eq response does a much better job of reproducing the actual trend of the temperature series than a simple regression linear in time, which is often used for estimating the warming trend.

Fig. 5
figure 5

Temperature anomalies for the Mean-G dataset (red in the online version) together with the CO2eq trend (dashed, black) and the residual natural variability component (blue)

Before making predictions, we need to verify the adequacy of the model and verify the hypothesis that the residual natural variability component has scaling fluctuations with exponent in the range \(\left( {{-}1/2,0} \right)\). The Haar fluctuation analysis for the Mean-G (bottom) and Mean-L (top) datasets before and after removing the anthropogenic trends are shown in Fig. 6 (red for the raw dataset fluctuations and blue for the detrended series in the online version). The reference lines with slopes \(H_{h} = - \,0.078 \pm 0.023\) for the global series and \(H_{h} = - \,0.200 \pm 0.021\) for the land surface series were obtained from regression of the residuals’ fluctuations between 2 months and 60 years. The points corresponding to scales of more than 60 years were not considered for estimating the parameters as there were not many fluctuations to average at those time scales. In addition, some of the low frequency natural variability was presumably removed with the forced variability. The units for \(\Delta t\) and \(\Delta T\) are months and °C, respectively. Notice that the anthropogenic warming breaks the scaling of the fluctuations at a time scale of around 10 years (the red and blue curves diverge at ~ 100 months). The residual natural variability, on the other hand, shows reasonably good scaling for the whole period analyzed (138 years). The same range of scaling with decreasing fluctuations has been obtained in temperature records from preindustrial multiproxies and GCMs preindustrial control runs (Lovejoy 2014).

Fig. 6
figure 6

Haar fluctuation analysis for the Mean-G (bottom) and Mean-L (top) datasets before (red) and after (blue) removing the trends. The reference lines with slopes \(H_{h} = - \,0.064 \pm 0.020\) for the global series and \(H_{h} = - \,0.241 \pm 0.017\) for the land surface series were obtained from regression of the residuals between 2 months and 60 years. The last points were dropped to get better statistics. The units for \(\Delta t\) and \(\Delta T\) are months and °C, respectively

The global series are a composition of land surface data and sea surface temperature data. The average temperature over the ocean shows fluctuations increasing with the time scale (positive \(H\)) up to 2 years. This corresponds to the ocean weather regime as discussed in Lovejoy and Schertzer (2013). The same break in the scaling is found in the global temperature fluctuations, but this break is subtle, and an overall unique scaling regime can be assumed for the global data. The influence of the ocean on the global temperature also brings its fluctuation exponent towards higher values (closer to zero) compared to the land surface fluctuations. This makes the global data more predictable than the land-only series.

In the frequency domain, the corresponding spectra for the Mean-G dataset are shown in Fig. 7. The raw spectrum for the natural variability series is represented in grey. It shows scaling, but with large fluctuations, as expected. To get better estimates of the exponent we can average the raw spectra using logarithmically spaced bins. These “cleaner” spectra for the series before and after removing the anthropogenic trend are shown in red and blue in the online version, respectively. Notice that they only differ appreciably for the low-frequency range, corresponding to the removed deterministic trend. The frequency, \(\omega\), is given in units of \(\left( {138\, {\text{years}}} \right)^{ - 1}\). The particularly low variabilities at frequencies corresponding to \(\left( { 3 0\, {\text{years}}} \right)^{ - 1}\) is an artefact of the 30-year detrending period used in most of the datasets. The solid black line was obtained from a linear regression on the residues. The exponent obtained from the absolute value of the slope was \(\beta = 0.81 \pm 0.13\). Using the monofractal relation \(\beta = 1 + 2H\), we obtain the estimate for the fluctuation exponent: \(H_{s} = - \,0.096 \pm 0.063\). The dashed reference line with slope corresponding to \(\beta_{h} = 1 + 2H_{h} = 0.84 \pm 0.05\) was included in the figure for comparison (using the value obtained from the Haar fluctuation analysis in Fig. 6).

Fig. 7
figure 7

Spectra for the Mean-G dataset. In grey is the raw spectrum of the residuals. Averages with logarithmically spaced bins are shown for the series before (dashed, red) and after (blue) removing the trend. The solid black line, with slope \(- \beta\), was obtained from a linear regression on the residues. The reference dashed line with absolute value of the slope \(1 + 2H_{h} = 0.84\) was included for comparison (using the value obtained from the Haar fluctuation analysis in Fig. 6). The frequency, \(\omega\), is given in units of \(\left( {138\, {\text{years}}} \right)^{ - 1}\)

It is worth mentioning that this very simple approach to removing the warming trend is a special (low memory) case of the much more general model of linear response theory with a scaling response function proposed by Hébert et al. (2019). In this work, the authors directly exploit the stochasticity of the internal variability and the linearity and scaling of the forced response to make projections based on historical data and a scaling step Climate Response Function that has a long memory. They not only include anthropogenic effects, but also solar and volcanic forcings. Consequently, the residuals they obtain once these forced components are removed, do not represent the forced natural variability response, but the internal variability of the system. The authors based their analysis on the assumption that this internal stochastic component can be approximated by an fGn process. This hypothesis has been confirmed on GCMs preindustrial control runs outputs where the forcings are not present.

3.3 Fitting fGn to global data

Having obtained the stationary natural variability component, \(T_{\text{nat}}\), for the Mean-G dataset from the residuals of the linear regression of \(T\left( t \right)\) vs. \(\log_{2} \left[ {\rho_{{{\text{CO}}_{2} {\text{eq}}}} \left( t \right)/\rho_{{{\text{CO}}_{2} {\text{eq}},{\text{pre}}}} } \right]\) (Eqs. (26) and (27)), we can now model this series using the theory presented in Sect. 2 and Appendix 1. The first step is to obtain the parameters \(\mu\), \(\sigma_{T}^{2}\) and \(H\). We would like to underline that these parameters describe the—infinite ensemble—fGn stochastic process, but we can only obtain estimates for them based on a single realization (our globally-averaged temperature time series). In Appendix 1 we show how to obtain the MLE for \(\mu\) and \(\sigma_{T}^{2}\). In the case of the fluctuation exponent, we can repeat the methods presented in Sect. 3.2 and obtain estimates from the slopes in the Haar fluctuations and the spectrum curves. However, as we mentioned before, it is clear in Figs. 6 and 7 that the error in the estimates is much higher for these methods than by using the MLE or QMLE due to the high variability of the fluctuations. Nevertheless, their advantage over the latter is that they are general and apply not only to Gaussian processes (such as fGn), but also to multifractal or other intermittent processes with different statistics. The MLE and QMLE methods make the extra assumption of adequacy of the fGn model, which ultimately must be verified.

To get an idea of how well the stochastic model describes the observational dataset, we created completely synthetic time series by superimposing fGn simulations on the low-frequency anthropogenic trend. Four randomly chosen simulations are shown in Fig. 8 together with the Mean-G dataset (top). The synthetic series were created using \(\lambda_{{2 \times {\text{CO}}_{2} {\text{eq}}}} = 2.03\) °C and \(T_{0} = - \,0.379\) °C for the anthropogenic trend, \(T_{\text{anth}}\), and following the procedure described in Appendix 1-i with parameters \(\mu = 0\)  °C, \(\sigma_{T} = 0.195\) °C and \(H = - 0.060\) for simulating \(T_{\text{nat}}\) (see Eqs. (26) and (27)). All these parameters were obtained by fitting the Mean-G observations in the period 1880–2017 (\(N = 1656\) months). In Appendix 2 (Table 5), we summarize the parameters obtained for the ten datasets and the corresponding mean series for global and for land.

Fig. 8
figure 8

Four randomly chosen synthetic time series together with the Mean-G dataset (top). The simulations were created by superimposing fGn simulations for \(T_{\text{nat}}\) to the low-frequency anthropogenic trend, \(T_{\text{anth}}\) (see Appendix 1 and Eqs. (26) and (27)). The parameters used for the simulation (shown in the figure) were obtained by fitting the Mean-G series in the period 1880–2017

Although a visual inspection of Fig. 8 is not a convincing proof of the applicability of the model, it is clear that if we eyeball the completely synthetic time series with the observational Mean-G dataset, you cannot tell which is which. A simple verification of the fGn behavior of the detrended data can be done by checking that the biased temporal estimate of the variance, \(SD_{T}^{2}\), and the value obtained using maximum likelihood, \(\hat{\sigma }_{T}^{2}\), satisfy Eq. (25) (derived in Appendix 1-iii.).

Following Eq. (25), the temporal estimate of the variance should depend on the number of months, \(n\), that is used for the estimates: \(SD_{T}^{2} \left( n \right) = \sigma_{T}^{2} \left( {1 - n^{2H} } \right)\). For only one time series, the estimate of \(SD_{T}^{2} \left( n \right)\) is noisy. To reduce the noise, this value can be estimated using k-segments of the series from \(t = k\) to \(t = k + n - 1\) (each of length \(n\)), and then averaged over the total ensemble of segments (in this case \(N_{\text{segments}} = N - n_{ \text{max} }\), where \(N = 1656\) months is the full length of the series and \(n_{ \text{max} } = 120\) months is the maximum length of the segments used):

$$\left\langle {SD_{T}^{2} \left( n \right)} \right\rangle = \frac{n - 1}{n}SD_{T}^{2} \left( n \right) = \frac{1}{{N - n_{\max } }}\sum\limits_{k = 1}^{{N - n_{\max} }} {\left[ {\frac{1}{n}\sum\limits_{t = k}^{k + n - 1} {\left( {T_{t} - \bar{T}_{n} } \right)^{2} } } \right]} ,$$
(28)

where \(\bar{T}_{n} = \mathop \sum \nolimits_{t = 1}^{n} T_{t} /n\), the values \(T_{t}\) are for the natural variability component of the Mean-G dataset and the factor \(\left( {n - 1} \right)/n\) accounts for the bias of the length-\(n\) sample estimate, \(SD_{T}^{2} \left( n \right)\), with respect to the length-\(n\) population variance, \(\left\langle SD_{T}^{2} \left( n \right) \right\rangle\).

In Fig. 9 we show in red line with circles the empirical values of the standard deviation \(\left\langle SD_{T}^{2} \left( n \right) \right\rangle^{1/2}\) as a function of \(n\) (obtained using Eq. (28) for the ensemble of \(N - n_{ \text{max} }\) segments). The function \(f_{{\sigma_{T} ,H}} \left( n \right) = \sigma_{T} \sqrt {\left( {1 - n^{2H} } \right)\left( {1 - n^{ - 1} } \right)}\) (obtained by replacing the expression for \(SD_{T}^{2} \left( n \right)\) in Eq. (28) and taking the square root) is plotted using \(\sigma_{T} = \hat{\sigma }_{T} = 0.195\) °C and the following values of \(H\): \(H_{f} = - 0.069\) (solid black line), obtained from the fit of the red curve; \(H_{l} = - 0.060\) (dashed line), obtained using MLE, and \(H_{q} = - 0.080\) (dotted line), from the QMLE. The empirical curve for a synthetic realization of Gaussian white noise with standard deviation \(\sigma_{\text{wn}} = 0.141\) °C was also included for comparison (blue line with squares).

Fig. 9
figure 9

Empirical values of \(\left\langle SD_{T}^{2} \left( n \right) \right\rangle^{1/2}\) as a function of \(n\), obtained using Eq. (28) (red line with circles). The function \(f_{{\sigma_{T} ,H}} \left( n \right) = \sigma_{T} \sqrt {\left( {1 - n^{2H} } \right)\left( {1 - n^{ - 1} } \right)}\), with \(\sigma_{T} = \hat{\sigma }_{T} = 0.195\) °C, is plotted for three values of \(H\): \(H_{f} = - 0.069\) (solid black line), obtained from the fit of the red curve; \(H_{l} = - 0.060\) (dashed line), obtained using MLE and \(H_{q} = - 0.080\) (dotted line), from QMLE. The empirical curve for a synthetic realization of Gaussian white noise with variance \(\sigma_{\text{wn}}^{2} = 0.02\) °C was also included for comparison (blue line with squares). The agreement between the red line with circles and the solid black line is an evidence of the fGn behavior of the natural variability

The difference between the red curve for the observational time series and the blue curve for the uncorrelated synthetic series illustrates the effects of the long-range correlations in the natural variability of the globally-averaged temperature time series. This strong dependence of the estimates of the variance with the length of the estimation period for \(H\) close to zero could have an influence on statistical methods that depend on the covariance matrix [e.g. empirical orthogonal function (EOF) and empirical mode decompositions (EMD)].

The agreement between the \(\left\langle SD_{T}^{2} \left( n \right) \right\rangle^{1/2}\) curve estimated from the data and the function \(f_{{\sigma_{T} ,H}} \left( n \right)\)—that only depends on the two parameters \(\sigma_{T}\) and \(H\)—is an evidence of the good fit of the fGn stochastic model to the natural variability. At the same time, it could be used as an alternative method for obtaining the parameters \(\sigma_{T}\) and \(H\) by fitting the curve \(\left\langle SD_{T}^{2} \left( n \right) \right\rangle^{1/2}\) based on observations using the function \(f_{{\sigma_{T} ,H}} \left( n \right)\).

More detailed statistical tests to check the fit of the model to the data are shown in Appendix 2 using the theory presented at the end of Appendix 1. The main conclusion is that the global average temperature series can be considered Gaussian as well as their innovations, while for the case of land average temperature, there are some deviations from Gaussianity. Nevertheless, the residual autocorrelation functions (RACF) satisfy the normality condition with good enough accuracy for all datasets, corroborating the whiteness of the innovations and hence that an fGn model can be considered a good approximation in all cases.

3.4 Forecast and validation

3.4.1 The low-frequency anthropogenic component

Ultimately, as a final step to confirm the adequacy of the model to simulating and forecasting global temperature data, we present the skill scores obtained from hindcast verifications and compare their values with the theoretical predictions. First, we should point out that for predicting the global temperature we need to forecast both the anthropogenic component and the natural variability. Our final estimator for \(k\) steps into the future, following Eq. (26), is given by:

$$\hat{T}\left( {t + k} \right) = \hat{T}_{\text{anth}} \left( {t + k} \right) + \hat{T}_{\text{nat}} \left( {t + k} \right),$$
(29)

where \(\hat{T}_{\text{nat}}\) is obtained from Eq. (22) using the theory presented in Sect. 2.2.1. The anthropogenic component, which we model with a separate low-frequency process must also be forecast. Nevertheless, even if we use persistence of the CO2eq increments, the error on predicting the low-frequency component is small compared to the error on forecasting the natural variability (for lead times up to a year or so). For this reason, for obtaining \(\hat{T}_{\text{anth}} \left( {t + k} \right)\) based on the previous values of the trend, we just assume persistence of the increments \(\Delta T_{\text{anth}} \left( {t,k} \right) = T_{\text{anth}} \left( t \right) - T_{\text{anth}} \left( {t - k} \right)\), that is:

$$\begin{aligned} \hat{T}_{\text{anth}} \left( {t + k} \right) &= T_{\text{anth}} \left( k \right) + \Delta T_{\text{anth}} \left( {t,k} \right) \hfill \\ \hat{T}_{\text{anth}} \left( {t + k} \right) &= 2T_{\text{anth}} \left( t \right) - T_{\text{anth}} \left( {t - k} \right). \hfill \\ \end{aligned}$$
(30)

For a linear trend, the absolute error \(\left\langle {\left| {T_{\text{anth}} \left( {t + k} \right) - \hat{T}_{\text{anth}} \left( {t + k} \right)} \right|} \right\rangle = \left\langle {\left| {\Delta T_{\text{anth}} \left( {t + k,k} \right) - \Delta T_{\text{anth}} \left( {t,k} \right)} \right|} \right\rangle = 0\). In the case of the CO2eq trend shown in black in Fig. 5, for small \(k\), the function is almost linear in a \(k\)-vecinity of any \(t\). This justifies the rejection of this error compared to the error on forecasting the natural variability. For reference, the root mean square error (\({\text{RMSE}}\)) using this method for the anthropogenic component, in the 1044-months hindcast period January 1931–December 2017, performed with \(k = 24\) months in advance for every month, was of 0.01 °C for all global datasets.

3.4.2 The natural variability component

For the natural variability, the expectation of the \({\text{RMSE}}\)—taking the infinite ensemble average using the theory for fGn—for a prediction \(k\) steps into the future is defined by:

$${\text{RMSE}}_{\text{nat}}^{\text{theory}} \left( k \right) = \sqrt {\left\langle {\left[ {T_{\text{nat}} \left( {t + k} \right) - \hat{T}_{\text{nat}} \left( {t + k} \right)} \right]^{2} } \right\rangle } .$$
(31)

According to the definition of \({\text{MSSS}}\), given by Eq. (12), and the analytical expression, Eq. (24), a theoretical ensemble estimate of \({\text{RMSE}}_{\text{nat}} \left( k \right)\), for prediction using a memory of \(m\) steps, is given by:

$${\text{RMSE}}_{\text{nat}}^{\text{theory}} \left( k \right){\text{ = RMSE}}_{{H,\sigma_{T} }}^{m} \left( k \right) = \sigma_{T} \sqrt {1 - {\tilde{\mathbf{C}}}_{H}^{m} \left( k \right)^{T} \left( {{\tilde{\mathbf{R}}}_{H}^{m} } \right)^{ - 1} {\tilde{\mathbf{C}}}_{H}^{m} \left( k \right)} .$$
(32)

Notice that, unlike the \({\text{MSSS}}\), this is not only a function of the horizon, \(k\), the memory, \(m\), and the exponent, \(H\), but also of the specific series we are forecasting due to the presence of the parameter \(\sigma_{T}\), which must be estimated using Eq. (50) in Appendix 1. As expected, for given values of \(k\), \(m\) and \(H\), the \({\text{RMSE}}\) is proportional to the amplitude of the series we want to predict.

3.4.3 Validation

To validate our model, we produced series of hindcasts at monthly resolution, each for a different horizon from 1 to 12 months, in the verification period January 1931–December 2017. For this hindcast series each subsequent point plotted on the graph was independently predicted using the information available \(k\) months before. What changes from month to month is the initialization date while the forecast horizon is kept fixed. Such hindcast series are useful because they show how close the predictions are to the observations for a given value of \(k\). The dependence with the horizon of many scores (e.g., the \({\text{RMSE}}\)), are obtained from the difference between hindcasts series at a fixed \(k\) and the corresponding series of observations.

StocSIPS assumes an additive fixed annual cycle independent of the low-frequency trend; it does not make distinctions from month to month from the point of view of the statistics of the anomalies. In fact, is this month-to-month correlation that is exploited as a source of predictability in the stochastic model. Nevertheless, there is always an intrinsic multiplicative seasonality in the data that is impossible to completely remove without affecting the scaling behavior of the spectrum. To account for the effects of this seasonality, we can stratify the observations and the forecasts series to show dependences with the initialization date.

For each horizon, \(k\), we used a memory \(m = 20k\). For example, to predict the average temperature for January 1931 with \(k = 1\) month, we used the previous 21 months, including December 1930, and the same was done for each month up to December 2017. For \(k = 2\) months, we used the previous 41 months, including December 1930, to produce the first forecast for February 1931, and so on.

Examples of the hindcasts series initialized every month, each for a different horizon, are shown in Fig. 10 for the Mean-G natural variability. In blue, we show the hindcasts series for \(k =\) 1, 3 and 6 months (bottom to top). In red we show the verification curve of observations for the natural variability starting in January 1931. The vertical gridlines correspond to the forecast and verification for each January; that is, initializing the first day of each January with data up to every December in the bottom panel, up to every October in the middle panel and up to every July in the top one. This shows how the stratification is done for obtaining dependences of the skill with the initialization date (shown later).

Fig. 10
figure 10

In blue, series of hindcasts for the Mean-G natural variability initialized every month for horizons \(k =\) 1, 3 and 6 months (bottom to top). In red, the verification curve of observations for the natural variability starting in January 1931. The vertical gridlines correspond to the forecast and verification for each January; that is, initializing with data up to every December in the bottom panel, every October in the middle and every July in the top

As can be seen in Fig. 10, there is a reduction of the amplitude and an increasing lag between the observed and forecast time series as the horizon increases (more noticeable in the top panel). This is due to the model tendency to predict the return rate towards the mean as a function of \(H\). Extremes can therefore only be predicted as a consequence of the anthropogenic increase. However, the general behavior of the temperature is well predicted.

Equation (31) is the definition of the infinite ensemble expectation of the \({\text{RMSE}}\), for which we get an analytical expression (Eq. (32)). The all-months verification \({\text{RMSE}}\) can then be computed from the series shown in Fig. 10 as:

$${\text{RMSE}}_{\text{nat}} \left( k \right) = \sqrt {\frac{1}{N - k + 1}\sum\limits_{t = 0}^{N - k + 1} {\left( {T_{\text{nat}} \left( {t + k} \right) - \hat{T}_{\text{nat}} \left( {t + k} \right)} \right)^{2} } }$$
(33)

where \(N = 1044\) months (from January 1931 to December 2017) and the number of terms in the sum is reduced in \(k - 1\) because the last verification date (December 2017) is the same for every \(k\) while the first verification date is \(k\) months after December 1931 (\(t = 0\)) for each horizon. This equation can be adapted to get the \({\text{RMSE}}\) for each horizon and for each initialization month.

In Fig. 11a, we show a comparison between the \({\text{RMSE}}\) obtained from the hindcasts of all the months in the verification period 1931–2017 using Eq. (33) and the theoretical expected \({\text{RMSE}}\), which is only a function of \(\hat{\sigma }_{T}\), \(H\) and \(m\) (Eq. (32)). The agreement between the theory (solid black) and the actual errors (red curve) is another confirmation of the model for the simulation and prediction of global temperature. In the figure, we also included the values \(\hat{\sigma }_{T} = 0.195\) °C and \(SD_{T} = 0.147\) °C for the Mean-G natural variability (dotted and dashed lines respectively). The value of the former is the same as shown in Table 5, while the value of the latter is slightly different from the value reported there because now it was computed for the verification series in the period 1931–2017 (red curve in Fig. 10). Notice that, for \(N = 1044\) months and \(H = - 0.060\) (see Table 5), \(SD_{T} /\sqrt {1 - N^{2H} } = 0.195\) °C, in perfect agreement with the value of \(\hat{\sigma }_{T}\) for that dataset.

Fig. 11
figure 11

\({\text{RMSE}}\) of StocSIPS forecasts for the Mean-G dataset. a Curves of \({\text{RMSE}}_{\text{nat}} \left( k \right)\) (red circles) and \({\text{RMSE}}_{\text{raw}} \left( k \right)\) (blue squares), for the natural variability component and for the raw series, respectively. The curves were obtained using Eq. (33) from the hindcasts of the Mean-G dataset including all the months in the verification period 1931–2017. The difference between the two is negligible. The theoretical expected \({\text{RMSE}}_{\text{nat}}^{\text{theory}} \left( k \right)\) (solid black), given by Eq. (32), is also shown for comparison. The values of \(\hat{\sigma }_{T}\) (Table 5) and \(SD_{T}\) for the Mean-G natural variability were included for reference (dotted and dashed lines, respectively). b Density plot with the \({\text{RMSE}}\) as a function of the forecast horizon and the initialization month. The diagonal pattern from the top-left corner to the bottom-right is an indication of the intrinsic multiplicative seasonality in the time-series. c Graphs of \({\text{RMSE}}\) vs. initialization month for different forecast horizons (\(k =\) 1, 3, 6 and 12 months). There is an increase in the \({\text{RMSE}}\) for the forecast of the Boreal winter months associated to the increase in the standard deviation, \(SD_{T}\), of the globally-averaged temperature for those months (shown in dashed black line in the bottom panels figures). d Graphs of \({\text{RMSE}}\) vs. \(k\) for different initialization months. For large values of \(k\) the skill of the model is small and the value of the \({\text{RMSE}}\) is close to the standard deviation for that specific month (dashed black line). The \({\text{RMSE}}\) graph in a is close to the average of the \({\text{RMSE}}\) graphs in d

The error for the anthropogenic trend forecast calculated using Eq. (30) is always less than 7% of the \({\text{RMSE}}_{\text{nat}}\) shown in Fig. 11a (see the final paragraph of Sect. 3.4.1). Because of this, its contribution to the overall error, \({\text{RMSE}}_{\text{raw}}\), on forecasting the raw temperature (natural plus anthropogenic) is lower than 0.4% for all horizons (compare the red-circles and the blue-squares curves in Fig. 11a). For all practical purposes, \({\text{RMSE}}_{\text{raw}} \approx {\text{RMSE}}_{\text{nat}}\) with a high degree of accuracy.

In Fig. 11b, we show a density plot with the \({\text{RMSE}}\) as a function of the forecast horizon and the initialization month. The diagonal pattern from the top-left corner to the bottom-right is an indication of the intrinsic seasonality in the time-series. This is shown in detail in the bottom panels figures.

In Fig. 11c, we show graphs of \({\text{RMSE}}\) vs. initialization month for different forecast horizons (\(k =\) 1, 3, 6 and 12 months). There is an increase in the \({\text{RMSE}}\) for the forecast of the Boreal winter months associated to the increase in the variability (standard deviation, \(SD_{T}\)) of the globally-averaged temperature for those months (shown in dashed black line in the bottom panels figures). In Fig. 11d, we show graphs of \({\text{RMSE}}\) vs. \(k\) for different initialization months. As expected, there is an increase in the \({\text{RMSE}}\) with \(k\). For large values of \(k\) the skill of the model is small and the value of the \({\text{RMSE}}\) is close to the standard deviation for that specific month (dashed black line). The \({\text{RMSE}}\) graph in panel (a) is close to the average of the \({\text{RMSE}}\) graphs in panel (d). It is actually the all-month \({\text{MSE}}\) the one that is the average of the \({\text{MSE}}\)s for each month (as long as the number of years used for the average is the same for every month).

Related to the \({\text{RMSE}}\) score, the mean square skill score (\({\text{MSSS}}\)) is a commonly used metric:

$${\text{MSSS}} = 1 - \frac{\text{MSE}}{{{\text{MSE}}_{\text{ref}} }},$$
(34)

where \({\text{MSE}} = {\text{RMSE}}^{2}\) is computed using Eq. (33) and \({\text{MSE}}_{\text{ref}}\) is the mean square error of some reference forecast.

The climatology—constant annual cycle taken from the average in a given reference period of at least 30 years—is commonly used as reference forecast. In this case, \({\text{MSE}}_{\text{ref}} = SD_{\text{raw}}^{2}\), is the variance of the raw series:

$$SD_{\text{raw}}^{2} = \overline{{\left( {T_{\text{anth}} + T_{\text{nat}} } \right)^{2} }} = \overline{{T_{\text{anth}}^{2} }} + SD_{T}^{2}$$
(35)

(assuming that the natural and anthropogenic variabilities are independent) and we call \({\text{MSSS}} = {\text{MSSS}}_{\text{raw}}\).

If we take as reference the anthropogenic trend forecast, then \({\text{MSE}}_{\text{ref}} = SD_{T}^{2}\), is the variance of the natural variability component (detrended series, \(T_{\text{nat}}\)) and we name \({\text{MSSS}} = {\text{MSSS}}_{\text{nat}}\). This would be the same as the skill on forecasting the detrended series taking as reference forecast its mean value. Using the theoretical expressions for \(SD_{T}^{2}\) and for \({\text{RMSE}} = {\text{RMSE}}_{\text{nat}}^{\text{theory}} \left( k \right)\) (Eqs. (25) and (32), respectively) we can obtain an analytical expression for \({\text{MSSS}}_{\text{nat}}\):

$${\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right) = \frac{{{\text{MSSS}}_{H}^{m} \left( k \right) - N^{2H} }}{{1 - N^{2H} }},$$
(36)

where \({\text{MSSS}}_{H}^{m} \left( k \right)\) was defined for the infinite ensemble average in Eq. (24) [Eq. (13) for the continuous-time case]. Notice that \({\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right)\) is not only a function of the fluctuation exponent, \(H\), and the memory used for the forecasts, \(m\), but also of the length of the verification period, \(N\). For an infinite series, the ergodicity of the system is verified; i.e. the temporal average is equal to the ensemble average: \({\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right) = {\text{MSSS}}_{H}^{m} \left( k \right)\) (recall \(H < 0\)). We can check the agreement between the theoretical result (Eq. (36)) and the \({\text{MSSS}}_{\text{nat}}\) obtained from hindcast to verify the validity of the model.

The anomaly correlation coefficient (\({\text{ACC}}\)) is another commonly used verification score. In this case, we can also obtain the \({\text{ACC}}\) for the raw or for the detrended series:

$${\text{ACC}}_{{{{\text{nat}} \mathord{\left/ {\vphantom {{\text{nat}} {\text{raw}}}} \right. \kern-0pt} {\text{raw}}}}} \left( k \right) \;=\; \frac{{\overline{{T_{{{{\text{nat}} \mathord{\left/ {\vphantom {{\text{nat}} {\text{raw}}}} \right. \kern-0pt} {\text{raw}}}}} \left( {t + k} \right)\hat{T}_{{{{\text{nat}} \mathord{\left/ {\vphantom {{\text{nat}} {\text{raw}}}} \right. \kern-0pt} {\text{raw}}}}} \left( {t + k} \right)}} }}{{SD_{{{T \mathord{\left/ {\vphantom {T {\text{raw}}}} \right. \kern-0pt} {\text{raw}}}}} \sqrt {\overline{{\hat{T}_{{{{\text{nat}} \mathord{\left/ {\vphantom {{\text{nat}} {\text{raw}}}} \right. \kern-0pt} {\text{raw}}}}} \left( t \right)^{2} }} } }},$$
(37)

where we assume that \(T\left( t \right)\) and the predictor \(\hat{T}\left( t \right)\) are zero mean anomalies, the overbars indicate temporal average for a constant forecast horizon, \(k\), and either all the subscripts are “nat” or all are “raw” depending on whether we forecast the detrended or the raw anomalies, respectively. In the latter case, spurious high values of the \({\text{ACC}}\) (similarly for the \({\text{MSSS}}\)) are found due to the presence of the deterministic trend. This is a very common flaw found throughout the literature, where this score is routinely reported for undetrended anomalies.

It is useful to note the relationship between the \({\text{ACC}}\) and \({\text{MSSS}}\) obtained from minimum mean square predictions. It can be easily seen from the orthogonality principle, \(\left\langle {\hat{T}\left( {T - \hat{T}} \right)} \right\rangle = 0\), that the stochastic predictions satisfy

$${\text{ACC}}_{\text{nat}} \left( k \right) = \sqrt {{\text{MSSS}}_{\text{nat}} \left( k \right)}$$
(38)

for any horizon \(k\). This relation can also be used to check the agreement between the theoretical predictions of the model and the actual results obtained from hindcasts verification.

In Fig. 12 we summarize the results for the \({\text{MSSS}}\) (top) and the \({\text{ACC}}\) (bottom). In Fig. 12a, we show curves of \({\text{MSSS}}\) vs. \(k\) for the Mean-G dataset considering all months in the verification period 1931–2017. In red line with circles, the curve for \({\text{MSSS}}_{\text{nat}}\) taking as reference the anthropogenic trend forecast, for which \({\text{MSE}}_{\text{ref}} = SD_{T}^{2}\) (\(SD_{T} = 0.147\) °C). In green line with triangles, the values for \({\text{MSSS}}_{\text{raw}}\) taking as reference the climatology forecast with \({\text{MSE}}_{\text{ref}} = SD_{\text{raw}}^{2}\) (\(SD_{\text{raw}} = 0.293\) °C). The theoretical expected \({\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right)\) (solid black), given by Eq. (36), is also shown for comparison. There is relatively good agreement between this theoretical prediction of the model and the \({\text{MSSS}}\) values obtained from the verification. The asymptotic value of \({\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right)\) for \(N \to \infty\) (given by Eq. (24)) is shown in dotted line with squares (dashed line for the continuous-time case, Eq. (13)). The longer the verification period the closer the \({\text{MSSS}}\) will be to that asymptotic value. For the discrete theoretical curves (solid black line and dotted black with squares), we used a memory \(m = 20k\). The small difference for \(k = 1\) month, between this curve and the one for the continuous case (solid black) is due to the high-frequency information loss in the discretization process.

Fig. 12
figure 12

\({\text{MSSS}}\) and \({\text{ACC}}\) of StocSIPS forecasts for the Mean-G dataset. a Curves of \({\text{MSSS}}\) vs. \(k\) for the Mean-G dataset considering all months in the verification period 1931–2017. In red line with circles, the curve for \({\text{MSSS}}_{\text{nat}}\) taking as reference the anthropogenic trend forecast. In green line with triangles, the values for \({\text{MSSS}}_{\text{raw}}\) taking as reference the climatology forecast. The theoretical expected \({\text{MSSS}}_{\text{nat}}^{\text{theory}} \left( k \right)\) (solid black), given by Eq. (36), is also shown for comparison. The asymptotic value for \(N \to \infty\) (given by Eq. (24)) is shown in dotted line with squares (dashed line for the continuous-time case, Eq. (13)). The longer the verification period the closer will be the \({\text{MSSS}}\) to that asymptotic value. b Density plot showing the \({\text{MSSS}}\) as a function of the forecast horizon and the initialization month. c Curves of \({\text{ACC}}_{\text{nat}}\) (red circles) and \({\text{ACC}}_{\text{raw}}\) (green triangles) as a function of the forecast horizon obtained from Eq. (37). The values of \(\sqrt {{\text{MSSS}}_{\text{nat}} }\) (blue squares) were included to check the consistency of the theoretical relationship given by Eq. (38). d Density plot of the \({\text{ACC}}\) as a function of the forecast horizon and the initialization month. The diagonal patterns from the top-left corner to the bottom-right in b, d are consequences of the intrinsic seasonality in the time-series

In Fig. 12c, we show curves of \({\text{ACC}}_{\text{nat}}\) (red circles) and \({\text{ACC}}_{\text{raw}}\) (green triangles) obtained from Eq. (37). Here, we can appreciate the spuriously high correlation values of \({\text{ACC}}_{\text{raw}}\) compared to the \({\text{ACC}}_{\text{nat}}\) due to the presence of the anthropogenic trend. The values of \(\sqrt {{\text{MSSS}}_{\text{nat}} }\) (blue squares) were included to check the consistency of the theoretical relationship given by Eq. (38); we see that it is relatively well satisfied, confirming the validity of the model.

In the right panels of Fig. 12, we show density plots with the \({\text{MSSS}}\) and the \({\text{ACC}}\) [panels (b) and (d), respectively] as a function of the forecast horizon and the initialization month. As we already showed for the \({\text{RMSE}}\), there are diagonal patterns from the top-left corner to the bottom-right as a consequence of the seasonality in the globally-averaged temperature anomalies. Nevertheless, for the \({\text{MSSS}}\) and the \({\text{ACC}}\), these patterns are relatively less significative compared to the ones in the \({\text{RMSE}}\) because—roughly speaking—both scores are functions of the ratio \({\text{RMSE}}_{\text{nat}} /SD_{T}\), reducing the impact of the variation of the standard deviation of each individual month (see Fig. 11c). Some results of the hindcast validation are summarized in Table 7 for the twelve datasets, including the mean series for the global and the land surface.

3.4.4 Parametric probability forecast

Probability forecasts from long-term prediction dynamical models are usually obtained by fitting probability distributions to the ensemble forecast for each month and deriving probabilities of three climatologically equiprobable categories: below normal, near normal and above normal conditions. In general, the form of the distribution and the skill of the forecast is affected by the size of the ensemble. One of the main advantages of StocSIPS over conventional numerical models is that, by its inherent stochastic nature, the infinite ensemble parametric probability forecast can be obtained analytically without the need of simulating any individual realization. Following the results presented in Sect. 2, the theoretical probability distribution forecast at horizon \(k\), taking data up to time \(t\), is a Gaussian with mean \(\mu_{f} = \hat{T}\left( {t + k} \right)\) given by Eq. (29) and standard deviation \(\sigma_{f} \left( k \right) = {\text{RMSE}}_{{H,\sigma_{\text{T}} }}^{m} \left( k \right)\) given by Eq. (32) (we neglected the error in the projection of the anthropogenic trend). In this section we only consider results for the full time series without stratification of the data. The theoretical expression for \(\sigma_{f} \left( k \right)\), obtained from the results for an infinite ensemble, only applies in this case.

The “reliability” is defined as the consistency or repeatability of the probabilistic forecast. In order to evaluate the reliability of the probabilistic forecast of an ensemble model, the ensemble spread score (\({\text{ESS}}\)) is commonly used as a summarizing metric. The ensemble spread score (\({\text{ESS}}\)) is defined as the ratio between the temporal mean of the intra-ensemble variance, \(\overline{{\sigma_{\text{ensemble}}^{2} }}\), and the mean square error between the ensemble mean and the observations (Palmer et al. 2006; Keller and Hense 2011; Pasternack et al. 2018):

$${\text{ESS}} = \frac{{\overline{{\sigma_{\text{ensemble}}^{2} }} }}{\text{MSE}}.$$
(39)

In the case of StocSIPS, \(\overline{{\sigma_{\text{ensemble}}^{2} }} = \sigma_{f}^{2}\) is obtained analytically using Eq. (32) and \({\text{MSE}} = {\text{RMSE}}^{2}\) is obtained from the hindcasts using Eq. (33).

Following Palmer et al. (2006), an \({\text{ESS}}\) of 1 indicates perfect reliability. The forecast is “overconfident” when \({\text{ESS}} <\) 1; i.e. the ensemble spread underestimates forecast error. If the ensemble spread is greater than the model error (\({\text{ESS}} >\) 1), the forecast is “overdispersive” and the forecast spread overestimates forecast error. In Fig. 11a, we showed that there is good agreement between the theoretical estimate \({\text{RMSE}}_{{H,\sigma_{\text{T}} }}^{m} \left( k \right) = \sigma_{f} \left( k \right)\) and the hindcast error \({\text{RMSE}}_{\text{nat}} \left( k \right)\) for all horizons \(k\), or—what is the same—between \(\overline{{\sigma_{\text{ensemble}}^{2} }}\) and \({\text{MSE}}\) in Eq. (39). This gives a value of \({\text{ESS}} \approx 1\), so that StocSIPS is a nearly perfectly reliable system without needing a recalibration of the forecast probability distribution.

Examples of probability forecasts for July 1984 for the natural variability component of the Mean-G dataset are shown in Fig. 13 for horizons \(k = 1\) and 3 months (left and right panels, respectively). That is, using data up to June 1984 for the \(k = 1\) month forecast and up to April 1984 for \(k = 3\) months. The normal probability density function (PDF) in grey represents the climatological distribution of the monthly temperatures for the detrended anomalies of the Mean-G dataset for the full period 1931–2017, for which \(\sigma_{\text{clim}} = SD_{T} = 0.147\) °C. The terciles of the climatological distribution are indicated by vertical dashed lines. These vertical lines define three equiprobable categories of above normal, near normal, and below normal monthly temperatures observed in the verification period. The forecast distribution is indicated by the black curve with the forecast mean \(\mu_{f} = \hat{T}\left( {{\text{Jul }}1984} \right) = - \,0.118\) °C and standard deviation \(\sigma_{f} = {\text{RMSE}}_{{H,\sigma_{\text{T}} }}^{m} \left( k \right) = 0.101\) °C for \(k = 1\) month (left panel) and \(\mu_{f} = - \,0.063\) °C, \(\sigma_{f} = 0.122\) °C for \(k = 3\) months (right panel). The areas under the forecast PDF in different colors indicate probabilities of below normal (blue), near normal (yellow), and above normal (pink) temperatures. These probabilities are summarized in the top-left corner as bar plots. The climatological probability of 33% is indicated by the horizontal dashed line. The observed temperature for that specific date, \(T_{\text{obs}} = -\, 0.191\) °C, is represented by the vertical green line. The forecast distributions for \(k = 1\) month are sharper than for \(k = 3\) months. As expected, the confidence of the probabilistic forecast decreases as the lead time increases and they become more conservative.

Fig. 13
figure 13

Example of parametric probability forecasts for July 1984 for the natural variability component of the Mean-G dataset for horizons \(k = 1\) and 3 months (left and right panels, respectively). That is, using data up to June 1984 for the \(k = 1\) month forecast and up to April 1984 for \(k = 3\) months. The normal probability density function in grey represents the climatological distribution of the monthly temperatures for the detrended anomalies of the Mean-G dataset for the full period 1931–2017. The terciles of the climatological distribution are indicated by vertical dashed lines. The colored areas under the forecast density function are proportional to the forecast probabilities for each category: below normal (blue), near normal (yellow) and above normal (pink). These probabilities are summarized in the top-left corner as bar plots. The climatological probability of 33% is indicated by the horizontal dashed line. The observed temperature for that specific date, \(T_{\text{obs}} = - 0.198\) °C, is represented by the vertical green line. The parameters for all the distributions are included in the legends

The verification of the probabilistic forecast in categories (above, near and below normal) is done using 3 \(\times\) 3 contingency tables (Stanski et al. 1989). The forecast and observed categories are simply classified in a table of three rows and three columns. There is a row for each observed category and a column for each forecast category. For each month forecast, one is added to the grid element of the contingency table according to the intersection of the forecast category and the observed category. In Table 1 we show the contingency table for the \(k = 1\) month forecast of the natural variability anomalies, \(T_{\text{nat}}\), of the Mean-G dataset (red curves in Fig. 10). The 1044 month period (Jan 1931–Dec 2017) was used for verification. The climatological distribution was defined using the mean and standard deviation of the detrended series over that period.

There are many scores that can be obtained from the contingency table (Stanski et al. 1989). In this paper we used the percent correct (\({\text{PC}}\)) obtained from the elements in the main diagonal (shown in bold in Table 3). This score, often called accuracy, is very intuitive and it counts, overall, the percentage of the category forecasts that were correct. From Table 1, we obtain the values \({\text{PC}}_{\text{nat}} = 100 \,\left( {272 + 160 + 250} \right)/1044 \approx 65\%\). We can obtain contingency tables for all \(k\). The dependence of the \({\text{PC}}\) with \(k\), is shown in Fig. 14 for the forecasts of the detrended anomalies, \(T_{\text{nat}}\) (blue line with squares in the figure). The dashed line at 33.3% is a reference showing the skill of the climatological forecast.

Table 1 Contingency table for the \(k = 1\) month forecast of the natural variability anomalies, \(T_{\text{nat}}\), of the Mean-G dataset (red curves in Fig. 10)
Fig. 14
figure 14

\({\text{PC}}\) as a function of \(k\) for the forecasts of the detrended anomalies, \(T_{\text{nat}}\) (blue line with squares in the figure). The dashed line at 33.3% is a reference showing the skill of the climatological forecast

The thresholds for the three equiprobable categories, above normal, near normal and below normal, will depend on the base-line of zero temperature and the standard deviation of the reference climatological distribution used. This will affect the distribution of events in the contingency table and consequently, the \({\text{PC}}\) score obtained even though the forecast system has not changed. In that sense, the \({\text{PC}}\) is a relative score. To avoid this dependence we could use absolute scores (independent of the climatology used), such as the ignorance score or the continuous ranked probability score (\({\text{CRPS}}\)) (Hersbach 2000; Gneiting et al. 2005). The latter is the one we used in this paper for evaluating the quality of the probability forecasts of StocSIPS.

The \({\text{CRPS}}\) for a forecast initialized at time \(t\) with horizon \(k\) is defined as:

$${\text{crps}}\left( {t + k} \right) = \int\limits_{ - \infty }^{\infty } {\left[ {P_{f} \left( {t + k,x} \right) - P_{o} \left( {t + k,x} \right)} \right]^{2} dx} ,$$
(40)

where \(P_{f} \left( {t,x} \right)\) is the cumulative forecast distribution with mean \(\mu_{f} = \hat{T}\left( {t + k} \right)\) given by Eq. (29) and standard deviation \(\sigma_{f} \left( k \right) = {\text{RMSE}}_{{H,\sigma_{\text{T}} }}^{m} \left( k \right)\) and \(P_{o} \left( {t + k,x} \right) = H\left[ {x - T_{\text{obs}} \left( {t + k} \right)} \right]\) is the cumulative observed distribution defined in terms of the Heaviside function \(H\left( x \right)\). The \({\text{CRPS}}\) can be determined for a single forecast, but a more accurate value is determined from a temporal average of many forecasts. The time mean \({\text{CRPS}}\) as a function of horizon \(k\) is:

$${\text{CRPS}}\left( k \right) = \frac{1}{N - k + 1}\sum\limits_{t = 0}^{N - k} {{\text{crps}}\left( {t + k} \right)} .$$
(41)

The \({\text{CRPS}}\) is a negatively oriented measure of forecast accuracy, similar to the \({\text{RMSE}}\) for deterministic ensemble mean forecasts; that is, smaller values indicate better skill. In fact, for deterministic forecasts, where \(\sigma_{f} \to 0\), the \({\text{crps}}\) in Eq. (40) reduces to the absolute error: \({\text{AE}} = \left| {T_{\text{obs}} - \hat{T}} \right|\). If we assume that \(P_{f}\) is the cumulative distribution function (CDF) of a normal distribution with mean \(\mu_{f}\) and standard deviation \(\sigma_{f}\), a closed form for \({\text{crps}}\) can be derived by repeatedly integrating by parts in Eq. (40) (Gneiting et al. 2005):

$${\text{crps}}\left( {t + k} \right) = \sigma_{f} \left\{ {\frac{{T_{\text{obs}} - \mu_{f} }}{{\sigma_{f} }}\left[ {2\varPhi \left( {\frac{{T_{\text{obs}} - \mu_{f} }}{{\sigma_{f} }}} \right) - 1} \right] + 2\varphi \left( {\frac{{T_{\text{obs}} - \mu_{f} }}{{\sigma_{f} }}} \right) - \frac{1}{\sqrt \pi }} \right\},$$
(42)

where \(\varphi \left( \cdot \right)\) and \({{\varPhi }}\left( \cdot \right)\) denote the PDF and the CDF, respectively, of the normal distribution with mean 0 and variance 1 evaluated at the normalized prediction error, \(\varepsilon_{n} = \left( {T_{\text{obs}} - \mu_{f} } \right)/\sigma_{f}\). This expression is very useful for obtaining the \({\text{CRPS}}\) of large or many verification series and for calibrating ensemble forecasts from its optimization. In this paper, we will use it for deriving a general result that relates the \({\text{CRPS}}\) with the \({\text{RMSE}}\) of the ensemble mean of Gaussian probability forecasts.

Let us assume that the ensemble-mean forecast error, \(\varepsilon = T_{\text{obs}} - \mu_{f}\), follows a Gaussian distribution with zero mean and standard deviation \(\sigma_{\varepsilon }\). Notice that \(\sigma_{f} \ne \sigma_{\varepsilon }\); the former is given by the intra-ensemble spread, \(\sigma_{f} = \sigma_{\text{ensemble}}\), and the latter can be estimated from the \({\text{RMSE}}\) between ensemble mean and observation. The \({\text{CRPS}}\) and the \({\text{RMSE}}\) can be related by averaging Eq. (42) for all possible values of the error, \(\varepsilon\):

$$\left\langle {{\text{crps}}\left( {t + k} \right)} \right\rangle_{\varepsilon } = \int\limits_{ - \infty }^{\infty } {\varphi \left( {\frac{\varepsilon }{{\sigma_{\varepsilon } }}} \right){\text{crps}}\left( {t + k} \right)d\left( {\frac{\varepsilon }{{\sigma_{\varepsilon } }}} \right)} ,$$
(43)

where \(\varphi \left( \cdot \right)\) is defined as in Eq. (42). If we now replace Eq. (42) in Eq. (43) and integrate by parts, we obtain:

$$\left\langle {{\text{crps}}\left( {t + k} \right)} \right\rangle_{\varepsilon } = \frac{{\sigma_{\varepsilon } }}{\sqrt \pi }\left[ {\sqrt {2\left( {1 + {{\sigma_{f}^{2} } \mathord{\left/ {\vphantom {{\sigma_{f}^{2} } {\sigma_{\varepsilon }^{2} }}} \right. \kern-0pt} {\sigma_{\varepsilon }^{2} }}} \right)} - {{\sigma_{f} } \mathord{\left/ {\vphantom {{\sigma_{f} } {\sigma_{\varepsilon } }}} \right. \kern-0pt} {\sigma_{\varepsilon } }}} \right].$$
(44)

The average for all possible values of the error, \(\left\langle \cdot \right\rangle_{\varepsilon }\), can be approximated by the time average, Eq. (41), for long enough verification periods. Moreover, we can approximate \(\sigma_{f}\) and \(\sigma_{\varepsilon }\) by their corresponding time-average estimates: \(\sigma_{f}^{2} \approx \overline{{\sigma_{\text{ensemble}}^{2} }}\) and \(\sigma_{\varepsilon } = {\text{RMSE}}\). Using the definition of \({\text{ESS}} = \overline{{\sigma_{\text{ensemble}}^{2} }} /{\text{MSE}}\) (Eq. (41)), we can finally rewrite Eq. (44) as:

$${\text{CRPS}}\left( k \right) = \frac{{{\text{RMSE}}\left( k \right)}}{\sqrt \pi }\lambda \left( {\text{ESS}} \right),$$
(45)

where \(\lambda \left( {\text{ESS}} \right) = \sqrt {2\left( {1 + {\text{ESS}}} \right)} - \sqrt {\text{ESS}}\). The function \(\lambda \left( {\text{ESS}} \right)\) takes the minimum value \(\lambda_{ \text{min} } = 1\) for a system with perfect reliability where \({\text{ESS}} = 1\). For any other value of \({\text{ESS}}\), \({\text{CRPS}} > {\text{RMSE}}/\sqrt \pi\). This result shows that, for ensemble prediction systems, the optimal way of producing parametric probabilistic forecasts, assuming a Gaussian distribution, is by calculating the standard deviation of the forecast distribution from the hindcast period rather than just from the current forecast ensemble. This result agrees with previous studies (Kharin and Zwiers 2003; Kharin et al. 2009, 2017), which reach the same conclusion from the optimization of other standard probabilistic skill measures (e.g., the Brier skill score).

As we mentioned before, StocSIPS is a system with nearly perfect reliability and it assumes, by hypothesis, the Gaussianity of the errors. In that sense, the analytical expression for \({\text{RMSE}}_{{H,\sigma_{\text{T}} }}^{m} \left( k \right)\) (Eq. (32)) can be used to obtain a theoretical expression for \({\text{CRPS}}\left( k \right)\) in Eq. (45). At the same time, the verification of this expression through a comparison between the values of \({\text{RMSE}}\left( k \right)\) and \({\text{CRPS}}\left( k \right)\) obtained from hindcasts can be used to check the validity of the model.

In Fig. 15 we show the time mean \({\text{CRPS}}\) as a function of \(k\), calculated in the verification period 1931–2017 for the probabilistic forecast of the monthly temperature anomalies of the Mean-G dataset. In the figure we show the results of the forecast of the raw anomalies (red circles), for which both the natural variability and the anthropogenic trend have to be forecast. Similarly to the previous results for the \({\text{RMSE}}\), the difference with the score of the forecast of the detrended anomalies is negligible (\({\text{CRPS}}_{\text{raw}} \approx {\text{CRPS}}_{\text{nat}}\)), corresponding to the very small error on the projection of the trend compared to the error on the prediction of the detrended anomalies. The line in blue with empty squares, almost coincident with the red line, shows the function \({\text{RMSE}}_{\text{raw}} \left( k \right)/\sqrt \pi\), in perfect agreement with the theoretical prediction for the optimal value \(\lambda_{ \text{min} } = 1\) in Eq. (45), corresponding to perfect reliability. In the green triangles we included the \({\text{CRPS}}\) of the reference climatology forecast of the natural variability component (\({\text{CRPS}}_{\text{nat}}^{\text{clim}} = 0.083\) °C). That is, using the fixed climatological probability distribution (shown in grey in Fig. 13), with zero mean and standard deviation \(\sigma_{\text{clim}} = 0.147\) °C, to forecast the detrended anomalies. If we use the same climatological distribution for forecasting the raw anomalies, we obtain the much larger value \({\text{CRPS}}_{\text{raw}}^{\text{clim}} = 0.181\) °C.

Fig. 15
figure 15

\({\text{CRPS}}\) as a function of the forecast horizon, \(k\), calculated in the verification period 1931–2017 for the probabilistic forecast of the monthly temperature anomalies of the Mean-G dataset. In red circles we show the \({\text{CRPS}}\) for the forecast of the raw anomalies, for which both the natural variability and the anthropogenic trend have to be forecast. The line in blue with squares, almost coincident with the red line, shows the function \({\text{RMSE}}_{\text{raw}} \left( k \right)/\sqrt \pi\), in perfect agreement with the theoretical prediction for the optimal value \(\lambda_{ \text{min} } = 1\) in Eq. (45). In green triangles we included the \({\text{CRPS}}\) of the reference climatology forecast of the detrended anomalies, \({\text{CRPS}}_{\text{nat}}^{\text{clim}} = 0.083\) °C

3.5 Comparison with GCMs

According to the World Meteorological Organization (WMO) (http://www.wmo.int/pages/prog/wcp/wcasp/gpc/gpc.php), there are currently fifteen major centers providing global seasonal forecasts. Thirteen of them have been officially designated by the WMO as Global Producing Centres for Long-Range Forecasts (GPCLRFs). The Meteorological Service of Canada (MSC) contributes with the Canadian Seasonal to Interannual Prediction System (CanSIPS) (Merryfield et al. 2011, 2013).

CanSIPS is a multi-model ensemble (MME) system using 10 members from each of two climate models (CanCM3 and CanCM4) developed by the Canadian Centre for Climate Modelling and Analysis (CCCma) for a total ensemble size of 20 realizations. It is a fully coupled atmosphere–ocean–ice–land prediction system relying on operational data assimilation for the initial state of the atmosphere, sea surface temperature and sea ice.

To evaluate forecasts and compare StocSIPS with CanSIPS, we accessed the publicly available series of hindcasts of CanSIPS covering the period 1981–2010 (CanSIPS 2016). The fields, available on 145 \(\times\) 73 latitude–longitude grids at resolutions of 2.5° \(\times\) 2.5° for each of the 20 ensemble members, were area-weight averaged to obtain global mean series of hindcasts at monthly resolution. CanSIPS produces forecast at the beginning of every month for the average value of that month and the next 11 months; i.e. for lead times from 0 to 11 months for each initialization date. In our case, that would correspond to forecast horizons (number of periods ahead that are forecasted) from 1 to 12 months. In the verification for \(k = 1\) month (lead zero), the hindcast period is January 1981–December 2010; for \(k = 2\) months (lead one), the hindcast period is February 1981–January 2011, and so on. This way, all the 12 series of hindcasts (one for each horizon) have a length of 360 months.

An optimal use of the dynamical model can be obtained after advanced postprocessing and calibration to reduce the bias of the model (Crochemore et al. 2016; Kharin et al. 2017; Van Schaeybroeck and Vannitsem 2018; Pasternack et al. 2018). We do not pretend here to make an exhaustive use of these calibration techniques. To keep the comparison simple, we followed the postprocessing for CanSIPS described in Sects. 3.a and 3.b of (Kharin et al. 2017) for deterministic and parametric probability forecasts, respectively. The statistical adjustment used by the authors is based on a linear rescaling of the ensemble mean and standard deviation of the anomaly forecast. The regression coefficients are obtained by minimizing the \({\text{MSE}}\) and \({\text{CRPS}}\) of the ensemble forecast in some verification period.

It can be easily shown that, after the recalibration, their method will lead to the optimal expression for \({\text{CRPS}}\) given by Eq. (45) when \({\text{ESS}} = 1\): \({\text{CRPS}} = {\text{RMSE}}/\sqrt \pi\). The recalibration method can be reduced to using—as optimal deterministic predictor—the projection of the ensemble mean that minimizes the \({\text{MSE}}\) in some verification period. Then, for the probability distribution forecast, the standard deviation is made equal to the \({\text{RMSE}}\) of the adjusted deterministic forecast instead of calculating it from the intra-ensemble spread. In that sense, the ensemble members are only useful for obtaining the ensemble mean. They do not contribute further to the forecast as the optimal probabilistic scores are obtained from the condition \({\text{ESS}} = 1\).

In their paper, (Kharin et al. 2017) also show that the optimal average skill scores are obtained when time-invariant (independent of the season) coefficients are used. We will use this result here and, instead of using only 30 years for estimating individual coefficients for each month, we use the monthly series to estimate constant coefficients based on 360 months that only depend on the lead time. These coefficients are more stable and do not significantly degrade the accuracy of the forecast due to sampling errors as would season-dependent coefficients.

In Fig. 16, we show an example of a forecast for the 12 months following April 1982 for both StocSIPS and CanSIPS. In red we show the verification curve of observations for the Mean-G dataset. In blue, the median hindcasts for StocSIPS, with the corresponding 95% confidence interval based on the \({\text{RMSE}}\) for the verification period. The ensemble mean for CanSIPS is shown in black, with each of the 20 members shown in dashed light colors and the 95% confidence interval based on the \({\text{RMSE}}\) of the hindcasts represented in grey. The CO2eq trend for the Mean-G dataset (green line) was added as a reference of the long-term equilibrium of the temperature fluctuations.

Fig. 16
figure 16

One example of forecast for the 12 months following April 1982 for both StocSIPS and CanSIPS. In red we show the verification curve of observations for the Mean-G dataset. In blue, the median hindcasts for StocSIPS, with the corresponding 95% confidence interval based on the \({\text{RMSE}}\) for the verification period. The ensemble mean for CanSIPS is shown in black, with each of the 20 members shown in dashed light colors and the 95% confidence interval based on the \({\text{RMSE}}\) of the hindcasts represented in grey. The CO2eq trend for the Mean-G dataset (green line) was added as a reference of the long-term equilibrium of the temperature fluctuations

As expected, the dispersion of the different ensemble members for the dynamical model increases as the horizon increases, which shows the stochastic-like character of GCMs for long-term predictions with the consequent loss in skill. Despite this increase in the spread of the ensemble, the dynamical model is underdispersive for all horizons. The \({\text{ESS}}\) (see Eq. (39) in Sect. 3.4.4) is in the range 0.57–0.74 for all lead times, except for zero months lead time where \({\text{ESS}} =\) 0.40. (Kharin et al. 2017) show that inflating the ensemble spread to satisfy the condition \({\text{ESS}} = 1\), results in more conservative estimates for the forecast probabilities of the three categories and improved reliability of the probability forecast and overall probabilistic skill scores.

3.5.1 Deterministic forecast comparison and seasonality

In this section we present scores for the deterministic forecast (ensemble mean forecast) for both models using for verification the Mean-G dataset in the period 1981–2010. In all cases we used the calibrated ensemble mean for CanSIPS, unless stated otherwise. In Fig. 17, we show density plots of the \({\text{RMSE}}\) as a function of the forecast horizon and the initialization month for StocSIPS and CanSIPS [panels (a) and (b), respectively]. For both models, there is a seasonality pattern with large errors during the Boreal winter months. In the case of StocSIPS, the largest values of the \({\text{RMSE}}\) are found for February, January and March, in that order, while CanSIPS has the largest errors for the forecasts of November and February. In Fig. 17c, we show the difference between CanSIPS \({\text{RMSE}}\) and StocSIPS \({\text{RMSE}}\); positive values indicate that StocSIPS has better skill. StocSIPS outperforms CanSIPS for most of the horizons and initialization months, except for the forecasts of January and February and some other initialization dates for \(k = 1\) month. The overall values of \({\text{RMSE}}\) vs. \(k\)—averaging for all the months in the verification period independently of the initialization date—are shown in Fig. 17d. The curve for StocSIPS is represented in red line with solid squares. For CanSIPS, we show in solid blue line with empty squares the \({\text{RMSE}}\) for the calibrated ensemble mean and in dashed blue line with solid circles the values for the unadjusted model. We can see that the improvement in the \({\text{RMSE}}\) due to the recalibration is very small. We included, for comparison, the curves obtained from hindcasts using persistence (black-triangles). That is, for horizon \(k\), assuming that the temperature \(k\) months into the future is predicted by the present value. The standard deviations for the detrended and for the raw series in the verification period were also included for reference (\(SD_{T}\) and \(SD_{\text{raw}}\), respectively).

Fig. 17
figure 17

Density plots of the \({\text{RMSE}}\) as a function of the forecast horizon, \(k\), and the initialization month for StocSIPS and CanSIPS (a, b, respectively). For both models, there is a seasonality pattern with large errors during the Boreal winter months. In c, we show the difference between CanSIPS and StocSIPS \({\text{RMSE}}\); positive values indicate that StocSIPS has better skill. StocSIPS outperforms CanSIPS for most of the horizons and initialization months, except for the forecasts of January and February and some other initialization dates for \(k = 1\) month. The overall values of \({\text{RMSE}}\) vs. \(k\)—averaging for all the months in the verification period independently of the initialization date—are shown in d. The curve for StocSIPS is represented in red squares. For CanSIPS, we show in solid blue line with empty squares the scores for the calibrated ensemble mean and in dashed blue line with solid circles the \({\text{RMSE}}\) for the unadjusted model. We can see that the improvement in the \({\text{RMSE}}\) due to the recalibration is very small. We included, for comparison, the curve obtained from hindcasts using persistence (black-triangles). The standard deviations for the detrended and for the raw series in the verification period were also included for reference (\(SD_{T}\) and \(SD_{\text{raw}}\), respectively)

Similar results are reported in Fig. 18 for the \({\text{MSSS}}\) and for the \({\text{ACC}}\). From the density plots [panels (a) and (c)] we can reach the same conclusion based on these scores: StocSIPS is better than CanSIPS for most of the horizons and initialization months, except for the forecasts of January and February. In panels (b) and (d), we show the all-months average scores without considering the initialization dates. The results for StocSIPS are shown in red line with solid squares and for CanSIPS in blue line with circles. In the \({\text{MSSS}}\) graphs, we only show the results for the calibrated model. For the \({\text{ACC}}\), as the calibration for CanSIPS is just a rescaling of the ensemble mean, the correlations with or without the calibration are the same. The curves obtained from hindcasts using persistence were also included for comparison (black-triangles).

Fig. 18
figure 18

Density plots for the \({\text{MSSS}}\) and for the \({\text{ACC}}\) (a, c, respectively) as a function of the forecast horizon and the initialization date. The positive values indicate that StocSIPS is better than CanSIPS for most of the horizons and initialization months, except for the forecasts of January and February. In b, d, we show the all-months average scores without considering the initialization dates. The results for StocSIPS are shown in red line with solid squares and for CanSIPS in blue line with circles. In the \({\text{MSSS}}\) graphs, we only show the results for the calibrated model. The horizontal line (green line with empty squares) included in the graph represents the value of skill obtained by projecting the CO2eq trend with respect to the climatological forecast. For the \({\text{ACC}}\), as the calibration for CanSIPS is just a rescaling of the ensemble mean, the correlations with or without the calibration are the same. The curves obtained from hindcasts using persistence were also included for comparison (black-triangles). The autocorrelation function for the detrended series (natural variability component), which is the same as the \({\text{ACC}}\) for the forecast of that series using persistence, was included for comparison as a dashed black curve (\({\text{ACC}}_{\text{persistence}}^{\text{nat}}\) in the figure)

For the \({\text{MSSS}}\), we choose the climatology as reference forecast with \({\text{MSE}}_{\text{ref}} = SD_{\text{raw}}^{2}\) being the variance of the raw series. We use accordingly the notation \({\text{MSSS}} = {\text{MSSS}}_{\text{raw}}\). The horizontal line (green empty squares) included in the graph represents the value of skill obtained by projecting the CO2eq trend with respect to the climatological forecast. The \({\text{MSSS}}\) can be easily computed as \({\text{MSSS}}_{\text{raw}}^{{{\text{CO}}2{\text{eq trend}}}} = 1 - SD_{T}^{2} /SD_{\text{raw}}^{2}\) (\(\approx\) 0.59 for the Mean-G dataset) because the errors of the forecast would be the amplitude of the detrended anomalies. The values obtained using this equation do not vary significantly for different horizons in the period analyzed. The extra contribution in the skill for StocSIPS comes from the forecast of the natural variability component.

The \({\text{ACC}}\), in the case of persistence, is the same as the autocorrelation function with lag \(k\) of the reference series. As mentioned before, the values obtained for the \({\text{ACC}}\) (even for the poor persistence forecasts), are spuriously high due to the anthropogenic trends superimposed on the series. Many authors report similarly high values without taking this fact into consideration. More realistic values would be obtained for the forecast of the detrended series, but there is no impartial way of removing the anthropogenic component for CanSIPS. The anthropogenic forcing is an intrinsic part of the GCM and to have a prediction of the natural variability only, we would have to remove its contribution before running the dynamical model. The autocorrelation function for the detrended series (natural variability component), which is the same as the \({\text{ACC}}\) for the forecast of that series using persistence, was included for comparison as a dashed black curve (\({\text{ACC}}_{\text{persistence}}^{\text{nat}}\) in the figure).

With respect to the comparison of the two models for the deterministic forecast, the conclusion is clear: StocSIPS presents better skill than CanSIPS on average for all the measures used and for all horizons except for \(k = 1\) month, where CanSIPS is slightly better. This was expected as, for the case of GCMs, 1 month is still close to the deterministic predictability limit imposed by the chaotic behavior of the system (~ 10 days for the atmosphere and 1–2 years for the ocean). After 1 month, the relative advantage of StocSIPS increases as the horizon increases. The reduced skill of StocSIPS for January and February are related to the intrinsic seasonality of the globally-averaged temperature. In future work, this seasonality in the variability could be removed by pre-processing, presumably resulting in further error reduction.

3.5.2 Probabilistic forecast comparison

In the previous section we showed how the two systems (CanSIPS and StocSIPS) compare for deterministic forecasts where the scores only depend on the ensemble mean. In Fig. 17d, we showed that the reduction in the \({\text{RMSE}}\) of CanSIPS due to the recalibration is very small. In this section we show how this improvement is more noticeable if probabilistic scoring rules are used, as they are influenced not only by the ensemble mean, but also by the ensemble spread which is readjusted to maximize the \({\text{CRPS}}\) using the condition \({\text{ESS}} = 1\) mentioned before.

Examples of probabilistic forecasts for July 1994 are shown in Fig. 19 for StocSIPS (left) and for CanSIPS (right) for horizon \(k = 2\) months (one month lead time; i.e. using data up to May 1994). The normal PDF in grey represents the climatological distribution of the monthly temperatures for the Mean-G dataset for the verification period 1981–2010. The terciles of the climatological distribution are indicated by vertical dashed lines. These vertical lines define three equiprobable categories of above normal, near normal, and below normal monthly temperatures observed in the verification period. In the left, the forecast distribution for StocSIPS is indicated by the black curve with the forecast mean \(\mu_{f} = \hat{T}\left( {{\text{July }}1994} \right) = -\, 0.105\) °C and standard deviation \(\sigma_{f} = {\text{RMSE}}_{\text{StocSIPS}} = 0.109\) °C for \(k = 2\) months. In the right, the distribution in dashed black line represents the unadjusted forecast of CanSIPS for \(k = 2\) months with parameters \(\mu_{f} = - \,0.051\) °C (ensemble mean) and \(\sigma_{f} = \sigma_{\text{ensemble}} = 0.084\) °C (intra-ensemble standard deviation). The calibrated forecast PDF for CanSIPS is shown in solid black in the right panel. The adjusted mean for this distribution is \(\mu_{f} = -\, 0.062\) °C and the inflated standard deviation \(\sigma_{f} = {\text{RMSE}}_{\text{CanSIPS}}^{\text{Calibrated}} = 0.112\) °C. The areas under the forecast PDF’s in different colors indicate probabilities of below normal (blue), near normal (yellow), and above normal (pink) temperatures. These probabilities are summarized in the top-left corner as bar plots. The climatological probability of 33% is indicated by the horizontal dashed line. The observed temperature for that specific date, \(T_{\text{obs}} = - 0.127\) °C, is represented by the vertical green line. For the unadjusted distribution of CanSIPS, the standard deviation for each specific month and lead time is estimated from the intra-ensemble spread and, as the model is underdispersive, it is generally lower than the standard deviation of the calibrated forecast distribution, which is estimated from the whole verification period and is constant for all months for a particular lead time.

Fig. 19
figure 19

Examples of probabilistic forecasts for July 1994 are shown for StocSIPS (left) and for CanSIPS (right) for horizon \(k = 2\) months (1 month lead time; i.e. using data up to May 1994). The normal probability density function in grey represents the climatological distribution of the monthly temperatures for the Mean-G dataset for the verification period 1981–2010. The terciles of the climatological distribution are indicated by vertical dashed lines. The colored areas under the forecast density function are proportional to the forecast probabilities for each category: below normal (blue), near normal (yellow) and above normal (pink). These probabilities are summarized in the top-left corner as bar plots. The climatological probability of 33% is indicated by the horizontal dashed line. The observed temperature for that specific date, \(T_{\text{obs}} = - 0.127\) °C, is represented by the vertical green line. In the right, the distribution in dashed black line represent the unadjusted forecast of CanSIPS for \(k = 2\) months and the calibrated forecast PDF is shown in solid black. The parameters for all the distributions are included in the legends

The combined contingency table for the forecasts of StocSIPS (grey rows) and CanSIPS (white rows with the values of the unadjusted forecast in parenthesis) for \(k = 1\) month is shown in Table 2. For observational reference we used the Mean-G dataset for verification in the period January 1981–December 2010 (360 months). The number of hits and total number of events are shown in bold in the main diagonal.

Table 2 Contingency table for 3 category probabilistic forecasts (below normal, near normal and above normal) for the raw (undetrended) Mean-G dataset with zero months lead time (\( k = 1 \) month)

The reduced number of observation events in the near-normal category is a consequence of the deviation from Gaussianity of the undetrended anomalies in the verification period 1981–2010. Specifically, there is a reduced kurtosis caused by the presence of the anthropogenic trend, as can be clearly seen in Fig. 5. The distribution of the detrended anomalies, \(T_{\text{nat}}\), is much closer to a Gaussian (see Appendix 2). In Table 3, we show the contingency table for the forecast of this series using StocSIPS. Now the total number of observations are almost equally distributed among the three categories obtained using the climatological distribution based on the detrended series.

Table 3 Contingency table for StocSIPS 3 category probabilistic forecasts (below normal, near normal and above normal) for the detrended series (\(T_{\text{nat}}\), red curves in Fig. 10) of the Mean-G dataset with zero months lead time (\(k = 1\) month)

From the diagonal elements in Table 2 we get the following \({\text{PC}}\) scores for \(k = 1\) month: for StocSIPS, \({\text{PC}}_{\text{StocSIPS}} \approx 78\%\) and for CanSIPS we get \({\text{PC}}_{\text{CanSIPS}}^{\text{Calibrated}} \approx 76\%\) and \({\text{PC}}_{\text{CanSIPS}}^{\text{Unadjusted}} \approx 74\%\) for the calibrated and the unadjusted forecasts, respectively. These values are spuriously high due to the presence of the trend in the raw series. Just from direct inspection of the reference series (red curve in Fig. 5), by projecting the trend we could predict that most of the temperature values in the decade 2001–2010 would fall in the above normal category, while most of the events in the decade 1981–2000 would fall in the below normal category. The \({\text{PC}}\) score obtained from Table 3 for the forecast of the natural variability component with \(k = 1\) month using StocSIPS is more realistic: \({\text{PC}}_{\text{StocSIPS}}^{\text{Nat}} \approx 57\%\). As we mentioned before, we cannot perform a similar forecast using CanSIPS since the anthropogenic forcing is an intrinsic part of the GCM. To obtain a prediction of the natural variability only, we would have to remove its contribution before running the dynamical model.

The \({\text{PC}}\) scores for all horizons from \(k =\) 1 to 12 months are shown in Fig. 20. In blue squares we show the \({\text{PC}}\) scores for StocSIPS and in red circles and green triangles for CanSIPS, calibrated and unadjusted forecasts, respectively. The solid black line shows the skill of StocSIPS for the forecast of the detrended series. The values obtained in this case are lower than those obtained for the raw anomalies. Those values are a better measure of the actual quality of the forecasting system since the spurious effects of the trend are removed. The dashed line at 33.3% is a reference showing the skill of the climatological forecast.

Fig. 20
figure 20

\({\text{PC}}\) as a function of \(k\) for StocSIPS (blue squares) and for CanSIPS, calibrated and unadjusted forecasts in red circles and green triangles, respectively. The solid black line shows the skill of StocSIPS for the forecast of the detrended series. The dashed line at 33.3% is a reference showing the skill of the climatological forecast

Three main conclusions can be obtained from the analysis of Fig. 20. First, there is an improvement on the probabilistic forecast skill of CanSIPS thanks to the recalibration. This improvement is small but is more noticeable than the one obtained for the deterministic scores (e.g. \({\text{RMSE}}\), \({\text{MSSS}}\)). Second, StocSIPS performs better than CanSIPS for all lead times and the relative advantage increases with the forecast horizon up to \(k = 7\) months. Finally, from the comparison of the blue and the solid black curves for the StocSIPS forecasts of the raw and the detrended series, respectively, we can notice that most of the skill comes from the projection of the trend and for \(k > 8\) months this is the only source of skill.

Although the \({\text{PC}}\) score for StocSIPS is larger for all horizons, it is difficult to evaluate the relative advantage over the probabilistic CanSIPS forecasts based on that score alone. The \({\text{PC}}\) is influenced by the climatological distribution used for defining the categories and mainly by the presence of the trend. A more realistic comparison should be based in absolute scores that only depend on the forecast system and are independent of the base-line or the climatology chosen. The dependence of the \({\text{CRPS}}\) with the forecast horizon is shown in Fig. 21 for both models in the verification period 1981–2010 for the Mean-G dataset. In red, we show the \({\text{CRPS}}\) for StocSIPS and in blue for CanSIPS with dotted line and solid circles for the unadjusted forecast and solid line with open squares for the calibrated forecast. The function \({\text{RMSE}}_{\text{CanSIPS}}^{\text{Calibrated}} \left( k \right)/\sqrt \pi\) is shown in dashed black line with triangles. There is perfect agreement between these optimal values and the \({\text{CRPS}}\) of CanSIPS after the calibration, in correspondence with Eq. (45). The score for the climatological forecast was included in the legend for reference (\({\text{CRPS}}_{\text{Climate}} = 0.117\)  °C).

Fig. 21
figure 21

\({\text{CRPS}}\) vs. \(k\) for both models in the verification period 1981–2010 for the Mean-G dataset. In red, we show the \({\text{CRPS}}\) for StocSIPS and in blue for CanSIPS with dotted line and solid circles for the unadjusted forecast and solid line with open squares for the calibrated forecast. The function \({\text{RMSE}}_{\text{CanSIPS}}^{\text{Calibrated}} \left( k \right)/\sqrt \pi\) is shown in dashed black line with triangles. There is perfect agreement between these optimal values and the \({\text{CRPS}}\) of CanSIPS after the calibration, in correspondence with Eq. (45). The score for the climatological forecast was included in the legend (\({\text{CRPS}}_{\text{Climate}} = 0.117\) °C)

If we compare Fig. 21 with Fig. 17d, we can see that the effect of the calibration of the CanSIPS output is more noticeable for the \({\text{CRPS}}\) than for the \({\text{RMSE}}\). The probabilistic forecast gains from both the inflation of the standard deviation and the scaling of the ensemble mean, while only the latter influences the deterministic forecast. After the adjustment, CanSIPS forecast is better for zero months lead time, but for the rest of the forecast horizons StocSIPS shows more skill. The relative advantage of the stochastic model over the GCM increases the further we forecast into the future. For the first month, the numerical model forecast still falls in the deterministic predictability limit.

4 Discussion

Over the last decades, conventional numerical approaches have developed to the point where they are now skillful at lead times that approach their theoretical (deterministic) predictability limits—itself close to the lifetimes of planetary structures (about 10 days). This threshold is due to the nonlinearity and complexity of the equations of atmospheric dynamics and their sensitive dependence on initial conditions (butterfly effect) (Lorenz 1963, 1972), and it cannot be overcome using purely deterministic models, not even by using combinations of deterministic-stochastic approaches such as recent stochastic parameterization models (Berner et al. 2017). In the macroweather regime (from 10 days to decades), GCMs become stochastic: the model integrations are extended far beyond their predictability limits producing “random” outputs that are finally averaged to obtain the forecast as the model ensemble mean.

The convergence of the dynamical models to their own climate follows from the macroweather property of internal fluctuations to decrease with time scale (see Fig. 6 for the case of natural variability—including volcanic and solar forcings). This scaling behavior with negative fluctuation exponent is present in real data and in GCM control runs, so the statistics of conventional numerical models’ variability is of similar type to that found in the real-world temperature series. The main problem is that each GCM converges to its own model climate, which is different from the actual climate. Also, the models cannot fully reproduce the actual high frequency weather noise even if the statistics of the noise they generate is similar to the real-world one.

In that sense, the SLIMM model, developed in Lovejoy et al. (2015), uses real data to generate the high-frequency noise with the correct statistical symmetries for the fluctuations and with a realistic climate. The main characteristics of SLIMM were summarized in Sect. 2.1. In this paper we presented the Stochastic Seasonal to Interannual Prediction System (StocSIPS), which includes SLIMM as the core model to forecast the natural variability component of the temperature field. StocSIPS also represents a more general framework for modelling the seasonality and the anthropogenic trend and the possible inclusion of other atmospheric fields at different temporal and spatial resolutions. In this sense, StocSIPS is the general system and SLIMM is the main part of it dedicated to the modelling of the stationary scaling series.

StocSIPS is based on some statistical properties of the macroweather regime such as: the Gaussianity of temperature fluctuations (as justified in Appendix 2) and the temporal scaling symmetry of the natural variability with negative fluctuation exponents, as shown in Sect. 3.2. It also assumes the independence between the high frequency natural variability of the temperature field and the low frequency component dominated by anthropogenic effects. The anthropogenic component is represented as a short memory linear response to equivalent CO2 forcing. The natural variability component is modeled and predicted using the stochastic approach originally proposed in SLIMM.

The scaling of the fluctuations implies that there are power-law decorrelations in the system and hence a large memory effect that can be exploited. The simplest stochastic model that includes both the Gaussianity and the scaling of the fluctuations is the fGn process. The Gaussian statistics of the temperature natural variability fluctuations allowed us to use the mean square prediction framework to build an optimal conditional expectation predictor based on a linear combination of past data.

In Sects. 2 and 2.1 we discuss how fGn can be obtained in SLIMM as the solution of a fractional order differential equation, which in turn is a generalization of the integer order stochastic differential equation in LIM models. The fractional derivative is introduced to account for the large memory effect given by the power-law behavior of the correlation function, in contrast, integer order derivatives imply short memory autoregressive moving average processes with asymptotic exponential decorrelations. The fractional differential equation can be obtained as the high frequency limit of a fractional energy balance equation in which the usual (exponential) temperature relaxation to equilibrium is replaced by power-law relaxation (work in progress). The main characteristics of SLIMM are summarized in Sect. 2.1, including the formal expression for the predictor as an integral of innovations going an infinite time into the past. Physically, the source of the long-range memory is energy stored in ocean gyres, eddies, at depth, or over land, in ice, soil moisture, etc.

The original technique that was used to make the SLIMM forecasts was basically correct, but it made several approximations (such as that the amount of data available for the forecast was infinite) and it was numerically cumbersome. Most of this work was dedicated to improving the mathematical treatment and the numerical techniques of SLIMM and validate them on ten different global temperature series since 1880 (five globally averaged and five over land).

The main improvement included in StocSIPS for the prediction of temperature series is the application of discrete-in-time fGn to obtain an optimal predictor based on a finite amount of past data. In Sect. 2.2.1 we give the theoretical expressions for the predictor coefficients and the skill as functions of the fluctuation exponent alone. This represents an advantage over other autoregressive models (AR, ARMA) which do not include fractional integrations that account for the long-term memory and hence do not consider the information from the distant past. An additional limitation of these approaches is that, in order to predict, the autocorrelation function for each time lag, \(C\left( {\Delta t} \right)\), must be estimated directly from the data. Each \(C\left( {\Delta t} \right)\) will have its own sampling error, this effectively introduces a large “noise” in the predictor estimates and a large computational cost if many coefficients are needed. In our fGn model the coefficients have an analytic expression which only depends on the fluctuation exponent, \(H\), obtained directly from the data exploiting the scale-invariance symmetry of the fluctuations; our problem is a statistically highly constrained problem of parametric estimation (\(H\)), not an unconstrained one (the entire \(C\left( {\Delta t} \right)\) function).

Other technical details of discrete-in-time fGn models are given in Appendix 1. We discuss how to produce exact realizations of fGn processes with a given length, \(N\) and family of parameters \(\sigma\), \(\mu\) and \(H\). The inverse process of obtaining those parameters for a given time series is also discussed. Other important results shown in Appendix 1 are an algorithm called quasi maximum likelihood estimation (QMLE) for obtaining the parameter \(H\), and the derivation of some ergodic properties of fGn processes. The QMLE method is slightly less accurate—but much more efficient computationally—than the usual maximum likelihood method. It has the advantage of being part of the verification process as it minimizes the mean square error of the hindcasts. The ergodicity of the variance of the process, expressed in Eq. (62), besides proving the convergence of the temporal average estimate of the variance to the ensemble variance, also shows that this convergence is ultra slow for values of \(H\) close to zero. This fact implies a strong dependence of the value of the resulting skill score with the length of the hindcast series used for verification. It could potentially impact statistical methods that depend on the covariance matrix, e.g. empirical orthogonal functions (EOF) and empirical mode decomposition (EMD).

The main result of this work is the application of StocSIPS to the modeling and forecasting of global temperature series. With that purpose, we selected the five major observation-based global temperature data series which are in common use (see Sect. 3.1).

Over the last century, low frequencies are dominated by anthropogenic effects and after 10–20 years the scaling regime changes from a negative to a positive value of \(H\) (see Fig. 6). The anthropogenic component was modelled as a linear response to equivalent CO2 forcing and removed. The residual natural variability component was then modeled and predicted using the theory presented in Sect. 2 and Appendix 1. The quality of the fit of the fGn model to the real data was evaluated in detail in Appendix 2.

To validate our model, we produced series of hindcasts for the period 1931–2017 with forecast horizons from 1 to 12 months. These series were stratified to obtain the dependence of the forecast skill with the forecast horizon and the initialization time. The RMSE of the hindcasts was lower than the standard deviation of the verification series for all horizons, showing positive skill. The values obtained for the all-month average results were in good agreement with the theoretical predictions. Other skill scores, such as the MSSS and the ACC were obtained.

StocSIPS source of predictability is based on the strong long-range correlations present in the temperature time series. In that sense, there is no source of skill coming from interannual variations since the model assumes that the seasonality, as well as the low frequency trend in the raw data, are deterministic. Theoretically, we should not expect a dependence of the skill on the initialization time. However, the stratification of the data shows that there is a multiplicative seasonality effect that makes the variability different for each individual month (see Fig. 11). The standard deviation of the temperature for the Boreal winter months is considerably larger than for the rest. This affects the skill of StocSIPS for those months and is a discrepancy with respect to the stationarity hypothesis. In future work, we could compensate for this effect through preprocessing of the time series and study the implications in StocSIPS forecast skill.

In Sect. 3.4.4 we showed how to make parametric probability forecasts using StocSIPS. For a prediction system with Gaussian errors, we derived a theoretical relation between the deterministic score RMSE and the probabilistic CRPS. We also showed that StocSIPS is—by definition—a nearly perfectly reliable system and that this theoretical relation is satisfied by the verification results.

Finally, in Sect. 3.5 we compared StocSIPS with the Canadian Seasonal to Interannual Prediction System (CanSIPS), which is one of the GCMs contributing to the Long-Range Forecast project of the World Meteorological Organization. Deterministic and probabilistic forecast skill scores for StocSIPS and for the CanSIPS were compared for the verification period 1981–2010.

The main conclusion is that, for the overall forecast including all the months in the verification period and without considering different initialization times, StocSIPS has higher skill than CanSIPS for all the metrics used and for all horizons except for \(k = 1\) month, where CanSIPS is slightly better. This was not surprising since for GCMs, 1 month is still close to the deterministic predictability threshold imposed by the chaotic behavior of the system (~ 10 days for the atmosphere and 1–2 years for the ocean). Beyond 1 month, the relative advantage of StocSIPS increases as the horizon increases. The seasonal stratification of the verification shows that, due to the interannual variability, CanSIPS performs better than StocSIPS for the forecasts of January and February. For other months (beyond zero months lead times) StocSIPS has better skill.

5 Conclusions

In this paper we presented the Stochastic Seasonal to Interannual Prediction System (StocSIPS), which is based on some statistical properties of the macroweather regime such as: the Gaussianity of temperature fluctuations and the temporal scaling symmetry of the natural variability. StocSIPS includes SLIMM as the core model to forecast the natural variability component of the temperature field. Here we improved the theory and numerical methods of SLIMM for its direct application to macroweather forecast.

In summary, StocSIPS models the temperature series as a superposition of a periodic signal corresponding to the annual cycle, a low frequency deterministic trend from anthropogenic forcings and a high frequency stochastic natural variability component. The annual cycle can be estimated directly from the data and is assumed constant in the future, at least for horizons of a few years. The anthropogenic component is represented as a linear response to equivalent CO2 forcing and can be projected very accurately 1 year into the future by using two parameters, the climate sensitivity and an offset, which can be obtained from linear regression given historical emissions. Finally, the natural variability is modeled as a discrete-in-time fGn process which is completely determined by the variance and the fluctuation exponent. That gives a total of only four parameters for modeling and predicting the temperature series. Those parameters are quite stable and can be estimated with good accuracy from past data.

The comparison with CanSIPS validates StocSIPS as a good alternative and a complementary approach to conventional numerical models. The reason is that whereas CanSIPS and StocSIPS have the same type of statistical variability around the climate state, the CanSIPS model climate is different from the real-world climate. In comparison, StocSIPS uses historical data to force the forecast to the real-world climate. From a forecast point of view, in general, GCMs can be seen as an initial value problem for generating many “stochastic” realizations of the state of the atmosphere, while StocSIPS is effectively a past value problem that directly estimates the most probable future state.

The prediction of global average temperature series presented in this paper is based on some symmetries of the macroweather regime: scale-invariance and low intermittency (rough Gaussianity). In a future paper (currently in preparation), we show how another macroweather symmetry, the statistical space time factorization (Lovejoy and de Lima 2015), can be included to extend the application of StocSIPS to temperature forecasts at a regional level with any arbitrary spatial resolution without need for downscaling. Another future application of StocSIPS that can be derived from this work is that, due to its qualitatively different approach with respect to traditional GCMs, it is possible to combine CanSIPS and StocSIPS into a single hybrid forecasting system that improves on both, especially at zero lead times. We have already obtained some predictions with the combined model, “CanStoc”, and we are currently working on a future publication on these results. We are also working on the application of StocSIPS to the forecast of GCMs preindustrial control runs to show that they satisfy the same macroweather symmetries as real-world data and hence, together with their deterministic predictability limits, there are also stochastic predictability limits applicable to GCMs. These limits correspond to the maximum possible skill that can be achieved by a stochastic Gaussian scaling system with a given scaling exponent (measure of the memory and the predictability in the data).

In May 2016, we created the website: http://www.physics.mcgill.ca/StocSIPS/, where global average and regional temperature forecasts at monthly, seasonal and annual resolutions using StocSIPS are published on a regular basis.