1 Introduction

The Navier–Stokes equations are the core of conventional numerical models for atmospheric prediction. These equations are derived from general conservation laws: energy, momentum, mass. Nevertheless, they have an implicit scale invariance symmetry, which is sometimes ignored in regard to other conservation laws (Lovejoy and Schertzer 2013; Palmer 2019). In this work, we exploit this symmetry as the basis for stochastic modelling and prediction of global temperature anomalies.

From hourly to centennial time scales, atmospheric fields are characterized by three scaling regimes: at high frequencies the weather, with fluctuations increasing with the time scale; there is a transition at \({\tau }_{w}\sim 10\) days to the macroweather, with fluctuations decreasing with scale; and at low frequencies the climate, again with increasing fluctuations. In recent times, the anthropogenic warming induces the transition to the climate regime at \({\tau }_{c}\) ~ 15–20 years, but pre-industrial records show \({\tau }_{c}>100\) years (the Holocene transition scale is still not well known) (Lovejoy 2014). The transition time, \({\tau }_{w}\), is the lifetime of planetary structures (Lovejoy and Schertzer 1986, 2010) and is therefore close to the deterministic predictability limit of conventional numerical weather prediction models. This predictability threshold for the models following a deterministic approach is imposed by the high complexity of the system and the sensitive dependence on initial conditions.

To extend the predictions to weekly, monthly and seasonal averages, stochasticity is incorporated at different levels in deterministic prediction systems. The ensemble approach, in which many different “random” realizations are obtained by integrating the model equations from slightly different initial conditions, is fundamentally stochastic. Sampling the attractor of the dynamic system is assumed to be equivalent to sampling the probability distribution of the possible outputs. Besides this implicit randomness product of chaos, explicit stochastic parameterization schemes are increasingly being incorporated in prediction systems. Hybrid deterministic-stochastic approaches seem to be the future of macroweather forecasting (Williams 2012; Christensen et al. 2017; Davini et al. 2017; Rackow and Juricke 2020). The importance and the current state of stochastic climate modelling has been extensively discussed in the reviews: Franzke et al. (2015) and Palmer (2019).

In addition to these stochastic improvements to the deterministic core of conventional Global Circulation Models (GCMs), purely stochastic models have evolved as a complementary approach since the pioneering works of Hasselmann (1976). For these Hasselmann-type models, the high frequency “weather” is treated as a driving noise of the low frequency components described by integer-order linear ordinary differential equations. The most well-known are the linear inverse models (LIM) (Penland and Matrosova 1994; Penland and Sardeshmukh 1995; Winkler et al. 2001; Newman et al. 2003; Sardeshmukh and Sura 2009). These have been presented as a benchmark for decadal surface temperature forecasts. On the other hand, one of the main limitations of the LIM, is that it implicitly assumes short range exponential temporal decorrelations, while it has been shown that the true decorrelations are closer to long-range power laws (Koscielny-Bunde et al. 1998; Franzke 2012; Rypdal et al. 2013; Yuan et al. 2015). Consequently, LIM models underestimate the memory of the system, imposing a useful limit to the forecast horizon of roughly 1 year (Newman 2013).

In Lovejoy et al. (2015), the ScaLIng Macroweather Model (SLIMM) was introduced as an alternative stochastic model that respects the scaling symmetry. SLIMM generalizes LIM to use fractional differential equations that involve strong, long-range memories; it is these long-range memories that are exploited in SLIMM forecasts. The solution to the fractional differential equation in SLIMM is a fractional Gaussian noise process that is used to model the natural temperature variability.

In a recent series of papers (Lovejoy 2019, 2021a, b; Lovejoy et al. 2021), the classical Energy Balance Equation (EBE) is generalized to fractional orders: the Fractional EBE (FEBE). The phenomenological derivation of the FEBE complements derivations based on the classical continuum mechanics heat equation and of the more general Fractional Heat Equation (FHE) (Lovejoy et al. 2021), which is a fractional diffusion equation that has been studied in the statistical physics literature. When the FEBE is driven by a Gaussian white noise, the result is fractional Relaxation noise (fRn) that generalizes the classical Ornstein–Uhlenbeck process and its high-frequency limit is a fractional Gaussian noise process (fGn) that generalizes Brownian motion (Lovejoy 2019). In that sense, the fractional differential equation and the corresponding fGn solution exploited in SLIMM are the high-frequency approximations of the FEBE and its fRn solution, respectively.

In Del Rio Amador and Lovejoy (2019) (hereafter DRAL) the Stochastic Seasonal to Interannual Prediction System (StocSIPS) was introduced and applied to the prediction of globally averaged monthly temperature in the macroweather regime. StocSIPS includes SLIMM as the core model to forecast the natural variability component of the temperature field, but also represents a more general framework for modelling the seasonality and the anthropogenic trend and the possible inclusion of other atmospheric fields at different temporal and spatial resolutions. In this sense, StocSIPS is the general system and SLIMM is the main part of it dedicated to the modelling of the stationary scaling series. StocSIPS also improves the mathematical and numerical techniques used in the original SLIMM.

In DRAL, we presented the basic theory behind StocSIPS and applied it to the prediction of globally averaged series showing verification skill scores in both deterministic and probabilistic modes. We also compared hindcasts with Canada’s operational long-range forecast system, the Canadian Seasonal to Interannual Prediction System (CanSIPS), and we showed that StocSIPS is just as accurate for 1-month forecasts, but significantly more accurate for longer lead times.

In this paper (specifically in Sects. 2.2 and 2.3), we verify that the scaling symmetry, which is the basis of StocSIPS, also holds at the regional level for monthly surface temperature, although some modifications must be introduced in the pre-processing of the tropical ocean temperature anomalies. In Sect. 2.4, we describe these particularities together with some theoretical details, although we purposely placed the most technical aspects in Appendix 1, so the main body of the article remains more results-based without too many overwhelming technicalities. Although all the equations and details relevant to this paper are given in the main text or in Appendix 1, the interested reader could refer to the more detailed theoretical description given in DRAL. The applicability of the model for all the regional series was confirmed through statistically testing in the second part of Sect. 2.4 and by contrasting the theoretically expected skill scores (if the model were perfect) with actual hindcast verification results for the natural temperature variability in Sect. 3.1. Finally, in Sect. 3.2 we apply StocSIPS to obtain monthly and seasonal predictions of the surface temperature and we show some preliminary comparisons with multi-model ensemble (MME) GCM results.

For 1 month lead time, our simple stochastic model shows similar values of the skill scores than the much more complex conventional models, with the advantage that it is much less expensive computationally and it can be easily adapted to direct hyperlocal prediction without need for downscaling. From a forecast point of view, GCMs can be seen as an initial value problem for generating many “stochastic” realizations of the state of the atmosphere, while StocSIPS is effectively a “past value problem” that estimates the most probable future state from long series of past data. The results obtained validate StocSIPS as a good alternative and a complementary approach to conventional numerical models. This complementarity is the basis for combining the two in a hybrid model that would bring the best of both worlds.

2 StocSIPS

2.1 Data preprocessing

In this study, the reference observational datasets are monthly average surface temperature (T2m) from the National Centers for Environmental Prediction/National Center for Atmospheric Research (NCEP/NCAR) Reanalysis 1 (Kalnay et al. 1996; NCEP/NCAR 2020). The data were accessed on January 3, 2020 and it covers the period January 1948 to December 2019 (864 months in total). All data were interpolated to a 2.5° latitude × 12.5° longitude grid across the globe for a total of 73 × 144 = 10,512 grid points. Our objective is to model and predict this dataset using the Stochastic Seasonal to Interannual Prediction System (StocSIPS).

StocSIPS was presented in DRAL and applied to the prediction of globally averaged temperature in the macroweather regime. The main idea behind it is to consider the temperature series at position \(\mathbf{x}\) as a combination of three independent signals:

$$T\left( {{\mathbf{x}},t} \right) = T_{{{\text{ac}}}} \left( {{\mathbf{x}},t} \right) + T_{{{\text{anth}}}} \left( {{\mathbf{x}},t} \right) + T_{{{\text{nat}}}} \left( {{\mathbf{x}},t} \right).$$
(1)

The first component, \({T}_{\text{ac}}\left(\mathbf{x},t\right)\), is the periodic annual cycle and is obtained from the mean temperature for each month in some reference period (here taken as the full length of the temperature datasets: 1948–2019). We assume that, for the time scales involved in the modelling and prediction problems, the annual cycle is unchanged. Also, for such a long verification period, the differences with the anomalies obtained using leave-one-out cross-validation methods are negligible. In Fig. 1a, we show an example of the raw temperature data, \(T\) (in red), and the periodic signal, \({T}_{\text{ac}}\) (in blue), for the time series corresponding to the coordinates 50.0°N, 2.5°E (near Paris, France). In the graph, only the period 1981–2010 is shown for visual clarity.

Fig. 1
figure 1

Example of signal pre-processing and spectra for the grid point with coordinates 50.0°N, 2.5°E (near Paris, France). a Raw temperature data, \(T\) (in red), and the periodic signal, \({T}_{\text{ac}}\) (in blue). Only the period 1981–2010 is shown for visual clarity. b The zero-mean residual natural variability component, \({T}_{\text{nat}}\) and the anthropogenic trend, \({T}_{\text{anth}}\) (red and blue, respectively). c Spectra of the raw temperature series and the residual component, \({T}_{\text{nat}}\) (blue and red, respectively). The exponent, \(\beta\) was obtained from the linear regression of the spectrum averaged over equally spaced logarithmic bins. The reference dashed line with slope \(1+2H\) was also included. d Similar to c, but now considering the average spectra for all the 10,512 grid points

The second component, \({T}_{\text{anth}}\left(\mathbf{x},t\right)\), is a deterministic low-frequency response to anthropogenic forcings. It can be modelled as a response to equivalent-CO2 (CO2eq) radiative forcing as the one used in CMIP5 simulations (Meinshausen et al. 2011):

$$T_{{{\text{anth}}}} \left( {{\mathbf{x}},t} \right) = \lambda_{{2 \times {\text{CO}}_{{2}} {\text{eq}}}} \left( {\mathbf{x}} \right)\log_{2} \left[ {{{\rho_{{{\text{CO}}_{{2}} {\text{eq}}}} \left( t \right)} \mathord{\left/ {\vphantom {{\rho_{{{\text{CO}}_{{2}} {\text{eq}}}} \left( t \right)} {\rho_{{{\text{CO}}_{{2}} {\text{eq,pre}}}} }}} \right. \kern-\nulldelimiterspace} {\rho_{{{\text{CO}}_{{2}} {\text{eq,pre}}}} }}} \right],$$
(2)

where \({\rho }_{{\text{CO}}_{2}\text{eq}}\) is the observed globally-averaged equivalent-CO2 concentration with preindustrial value \({\rho }_{{\text{CO}}_{2}\text{eq},\text{pre}}=277\text{ ppm}\) and \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\left(\mathbf{x}\right)\) is the transient climate sensitivity at position \(\mathbf{x}\) (that excludes delayed responses) related to the doubling of atmospheric equivalent-CO2 concentrations. For \({\rho }_{{\text{CO}}_{2}\text{eq}}\) we used the CMIP5 simulation values (Meinshausen et al. 2011). The definition of CO2eq includes not only greenhouse gases, but also aerosols, with their corresponding cooling effect. The sensitivity \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\left(\mathbf{x}\right)\) is estimated from the linear regression of \(T\left(\mathbf{x},t\right)\) vs. \({\text{log}}_{2}\left[{\rho }_{{\text{CO}}_{2}\text{eq}}\left(t\right)/{\rho }_{{\text{CO}}_{2}\text{eq},\text{pre}}\right]\). This relationship ignores memory effects, but these are not too strong during periods where the forcing continues to increase. The zero-mean residual natural variability component, \({T}_{\text{nat}}\left(\mathbf{x},t\right)\), includes “internal” variability and the response of the system to other natural forcings (e.g.: volcanic and solar). Both components, \({T}_{\text{anth}}\) and \({T}_{\text{nat}}\), are shown in Fig. 1b (blue and red, respectively) for the same point as in Fig. 1a with coordinates 50.0°N, 2.5°E. At this location, it could be argued that the anthropogenic trend is insignificant compared to the amplitude of the natural component, but at some other locations it is more relevant. Besides, the cumulative effect of \({T}_{\text{anth}}\) for all the grid points is highly relevant for the globally averaged temperature (see Fig. 5 in DRAL).

Instead of using CO2eq, alternatively, we could have used the CO2 concentration in Eq. (2) as a surrogate for all anthropogenic effects, avoiding various uncertain radiative assumptions needed to estimate CO2eq (especially aerosols). Nevertheless, from the point of view of detrending, the residuals, \({T}_{\text{nat}}\), would remain almost unchanged because of the nearly linear relation between the actual CO2 concentration and the estimated equivalent concentration (correlation coefficient > 0.993). There are more rigorous methods of detrending the original signal to obtain independent components with “stationary” residuals while preserving the length of the time series [e.g.: empirical mode decomposition (EMD) (Zeiler et al. 2010), ensemble empirical mode decomposition (EEMD) (Wu and Huang 2009), LOESS (Cleveland and Devlin 1988; Clarke and Richardson 2021)]. Nevertheless, the method used here gives a direct physical meaning to the residual, \({T}_{\text{nat}}\), and to the low-frequency trend, \({T}_{\text{anth}}\). It is also accurate enough for obtaining the detrended temperature anomalies, whose characterization, modelling and prediction are the focus of the following sections. A more accurate method that takes into account the physics of the system adding memory effects to the heat balance equation, was presented in Procyk et al. (2020).

2.2 Spectra

The effects of the detrending in the frequency domain can be observed by comparing the spectra of the raw temperature series and the residual component, \({T}_{\text{nat}}\). In Fig. 1c we show these two spectra in a log–log scale in blue and red, respectively, for the grid point with coordinates 50.0°N, 2.5°E. The spectrum of the detrended series was smoothed by taking averages with logarithmically spaced bins. Notice that the peak corresponding to the annual cycle was removed along with the signal \({T}_{\text{ac}}\), as well as the low-frequency response corresponding to \({T}_{\text{anth}}\). The frequency, \(\omega\), is given in units of cycles per 72 years (72 years is the length of the series).

After removing the peaks corresponding to the annual cycle (and harmonics) and the low-frequency response, the only relevant feature of the spectrum of the detrended anomalies, \(E\left(\omega \right)\), is its scale invariance (power–law behaviour):

$$E\left( \omega \right) \propto \omega^{ - \beta } .$$
(3)

The exponent, \(\beta =0.20\pm 0.11\), can be obtained from the linear regression of the spectrum averaged over equally spaced logarithmic bins (shown in red). The line corresponding to the best fit is shown in black in the figure. We also included a reference dashed line with slope \(1+2H\), where \(H\) is the fluctuation exponent (see next section).

The scaling is even more noticeable in the less noisy spectrum shown in Fig. 1d, obtained by averaging the spectra of all the 10,512 grid points. Now the peaks corresponding to the periodic signal and the low-frequency contribution associated with anthropogenic effects are more clearly visible. The value of the exponent obtained in this case is \(\beta =0.42\pm 0.02\). The implications of this scale-invariance will be treated in more detail in the following sections.

2.3 Scaling

In DRAL, it was shown that, for the case of globally averaged monthly atmospheric surface temperature, the statistics of \({T}_{\text{nat}}\left(t\right)\) are characterized by one main symmetry: the power-law (scaling) behaviour of the average of the fluctuations, \(\Delta T\), as a function of the time scale, \(\Delta t\):

$$\left\langle {\left| {\Delta T\left( {\Delta t} \right)} \right|} \right\rangle \propto \Delta t^{H} ,$$
(4)

where \(H\) is the fluctuation exponent and the brackets \(\langle \bullet \rangle\) denote ensemble averaging. For \(-1<H<0\), Haar fluctuations, not differences, should be used (Lovejoy and Schertzer 2012a). Many examples of the low intermittency (“spikiness”) of the temperature fluctuations are given in Lovejoy and Schertzer (2013). Equivalently to Eq. (4), in the frequency domain the spectrum satisfies the previously mentioned equation: \(E\left(\omega \right)\propto {\omega }^{-\beta }\), with \(\beta =1+2H\) for monofractal processes. These statistical symmetries are not exclusive to the globally averaged temperature. There are many empirical results that show a “colored noise” scaling behaviour in local temperature spectra as well as in many other atmospheric variables (Brockwell and Davis 1991; Blender et al. 2006; Box et al. 2008; Lovejoy and Schertzer 2013; Varotsos et al. 2013; Christensen et al. 2015).

For globally averaged temperature at scales between 1 month and several decades, there is a single scaling regime with \(H<0\). If we analyze temperature time series from daily (or shorter) time scales, we find that, in general, there is a transition between two scaling regimes: from the weather, characterized by fluctuations increasing with the time scale (\(H>0\)), to the macroweather regime where fluctuations tend to cancel out as the time scale increases (\(H<0\)).

This transition in the statistical properties of the atmosphere at scales of the order of \({\tau }_{w}\approx 10\) days, has been theorized by Lovejoy and Schertzer (1986) as the lifetime of planetary sized structures and estimated from first principles from knowledge of the solar output and the efficiency of conversion from solar to mechanical energy (Lovejoy and Schertzer 2010). A similar transition at \({\tau }_{w}\approx 1\) year was observed for the average surface temperature over the ocean (Lovejoy and Schertzer 2013).

The fluctuation exponents that characterize the weather and the macroweather regimes for air surface temperature (\({H}_{w}\) and \({H}_{mw}\), respectively), as well as the transition scale \({\tau }_{w}\), are functions of position with a strong dependence on the latitude. In Fig. 2, we show a map of the exponents obtained from the Haar fluctuation analysis (Lovejoy and Schertzer 2012a) in the high-frequency scaling regime between 2 months and 2 years. In general, there is a consistent difference between the macroweather exponents of surface temperature over the oceans and over land with \(-1/2<{H}_{mw}^{land}<{H}_{mw}^{ocean}<0\) (the ocean is more persistent and the fluctuations cancel out more slowly). Also, for any position over land and for most of the ocean, we find that \({\tau }_{w}<1\) month, so for surface temperature at monthly resolution, only the macroweather regime is observed. Only for the tropical ocean we do find a well-defined transition with \({\tau }_{w}\) as much as 2 years. Consequently, for this region, at time scales \(\Delta t<{\tau }_{w}\) the statistics of the fluctuations are those of the weather regime with positive exponents (red in Fig. 2). This longer transition in the SST corresponds to an analogous “ocean weather”–“ocean macroweather” transition (Lovejoy and Schertzer 2012b). It corresponds to lifetimes of large-scale ocean gyres (and other structures) that live much longer than atmospheric structures.

Fig. 2
figure 2

Map of the fluctuation exponents obtained from the Haar fluctuation analysis (Lovejoy and Schertzer 2012a), in the high-frequency scaling regime between 2 months and 2 years

As an example, we show the Haar fluctuation analysis of the time series presented in Fig. 3a. We choose a point over land (time series in blue in Fig. 3a) with coordinates 50.0°N, 2.5°E (same grid point used before in Sect. 2.1) and a point in the tropical ocean (red in Fig. 3a) with coordinates 7.5°S, 30°W. In Fig. 3b, we show the average fluctuation as a function of the time scale before and after removing the anthropogenic trend for the point over land [red line with circles for the anomalies before removing the anthropogenic component (\({T}_{\text{anom}}={T}_{\text{nat}}+{T}_{\text{anth}}\)) and blue line with empty squares for the detrended anomalies (\({T}_{\text{nat}}\))]. The reference line with slope \({H}_{mw}=-0.39\pm 0.02\) was obtained from regression of the residuals’ fluctuations between 2 months and 18.5 years. The units for \(\Delta t\) and \(\Delta T\) are months and °C, respectively.

Fig. 3
figure 3

Examples of Haar fluctuation analysis for two points, one over land and one over ocean. a In blue, time series for a point over land with coordinates 50.0°N, 2.5°E (same grid point used before in Sect. 2.1); in red, for a point over ocean located at 7.5°S, 30°W and in black, the series of the temperature differences, \({\delta T}_{\text{nat}}\left(t\right)={T}_{\text{nat}}\left(t\right)-{T}_{\text{nat}}\left(t-1\right)\), for the same point over ocean (increments of the time series in red). b Average fluctuation as a function of the time scale before and after removing the anthropogenic trend for the point over land (red line with circles for the anomalies before removing the anthropogenic component and blue line with empty squares for the detrended anomalies). The reference lines with slopes \({H}_{mw}=-0.39\pm 0.02\) and \({H}_{mw}=1.0\pm 0.1\) were obtained from regression of the anomalies’ fluctuations in the respective macroweather and climate regimes. c Same as in b but now for the point over ocean. The three regimes (weather, macroweather and climate) are observed for this point. The corresponding transition scales and the respective exponents obtained from linear regression are also included in the graph. d Haar fluctuation analysis of the series of increments \({\delta T}_{\text{nat}}\left(t\right)\) for the point over ocean. The dashed line included as reference has slope \(H={H}_{w}-1=-0.82\), where \({H}_{w}\) is the one shown in c and the solid line has a slope \(H=-0.68\), which is the exponent obtained from the maximum likelihood method assuming that \({\delta T}_{\text{nat}}\) is a fractional Gaussian noise (fGn) process (see next section)

Notice that the anthropogenic warming breaks the scaling of the undetrended anomalies’ fluctuations at a time scale of 15–20 years (the fluctuations start to increase with the scale at ~ 200 months). The fluctuation exponent for this low-frequency (climate) regime is \({H}_{c}=1.0\pm 0.1\)—i.e., the fluctuations increase linearly with time following the almost linear growth of CO2 concentration in recent epochs. The residual natural variability, on the other hand, shows reasonably good scaling for the whole period analyzed (66 years). In analysis of temperature records from preindustrial multiproxies and GCMs preindustrial control runs (Lovejoy 2014), evidence was presented showing that the range of scaling with decreasing fluctuations (pre-industrial macroweather) may extend to more than 100 years.

As we mentioned before, for this point over land, only one regime with fluctuations decreasing with the time scale (the macroweather regime) is present for the natural variability. So, we can conclude that, at this location, the weather–macroweather transition occurs at \({\tau }_{w}<1\) month (maximum resolution of the analyzed data). This was confirmed using 6-h resolution data. In contrast, as we show in Fig. 3c, if we analyze the grid point in the tropical region over the ocean (time series in red in Fig. 3a), there is a clear transition at \({\tau }_{w}\sim 1.5\) years from the weather regime (with fluctuations increasing with the scale) to the macroweather regime (with decreasing average fluctuations). A further transition occurs in the undetrended anomalies at \({\tau }_{c}\sim 18.5\) years to the climate regime, where fluctuations start to increase again with the time scale. As before, this transition in recent epochs is induced by anthropogenic effects. The actual transition in the natural variability, as obtained from preindustrial temperature records, apparently occurs at time scales longer than 100 years, which is consistent with the blue curves for \({T}_{\text{nat}}\) in Fig. 3b,c after we remove the anthropogenic trend. The fluctuation exponent for the three regimes, weather–macroweather–climate, has values \({H}_{w}=0.18\pm 0.03\), \({H}_{mw}=-0.18\pm 0.05\) and \({H}_{c}=1.0\pm 0.2\), respectively (shown in the graph) consistent with a smooth low-frequency behaviour.

A visual comparison between the blue and red curves in Fig. 3a shows a clear difference in the temperature behaviour at these two grid points. While over land, consecutive values of temperature tend to cancel out, over the ocean the temperature is more persistent and only after several time steps the anomalies change sign. This is confirmed in the Haar fluctuation analysis shown in Fig. 3b,c. This difference in the statistical behaviour imply that, while a fractional Gaussian noise (fGn) model is a good fit for the extratropics, we cannot use it to describe the tropical region. Nevertheless, if we take the first differences in the time series for the grid point over the tropical ocean, the new series \({\delta T}_{\text{nat}}\left(t\right)={T}_{\text{nat}}\left(t\right)-{T}_{\text{nat}}\left(t-1\right)\) (shown in black in Fig. 3a) has a statistical behaviour which is clearly more similar to the series over land with consecutive fluctuations cancelling out. As we can see in the graph shown in Fig. 3d, the new series \({\delta T}_{\text{nat}}\left(t\right)\) has a scaling regime for small \(\Delta t\) with negative fluctuation exponent similar to that of Fig. 3b. By taking first differences in the tropics, we are able to use fGn process everywhere to predict the time series, then we can go back to the original series for those places by taking cumulative sums.

There is still a change in the slope at \({\tau }_{w}\sim 1.5\) years, corresponding to the one in the original series shown in Fig. 3c. The dashed line included as reference has a slope \(H={H}_{w}-1=-0.82\). The series \({\delta T}_{\text{nat}}\), being the increments of the series \({T}_{\text{nat}}\), should have an exponent of the dominant high frequencies reduced by one. We also included in solid black, a reference line with slope \(H=-0.68\), which is the exponent obtained from the maximum likelihood method assuming that \({\delta T}_{\text{nat}}\) is an fGn process (see Sect. 2.4.2).

These examples—shown here for two different positions—are representative of the behaviour of the natural temperature variability all over the Earth. In fact, by taking the first differences of the time series in those places over the tropical ocean with weather regime at monthly resolution, we can reduce our analysis to only one case of self-similar time series with negative exponent in the range \(-1<H<0\). This simplification emphasizes the role of the scaling symmetry, which is sometimes ignored in regard to other conservation laws, in spite of being also present in the Navier–Stokes equations (Lovejoy and Schertzer 2013; Palmer 2019), which are the core of conventional numerical models for atmospheric prediction and hence respected by them. In this work, we exploit this symmetry as the basis for stochastic modelling and prediction of global temperature anomalies.

2.4 Stochastic modelling using fGn and fRn

2.4.1 Properties of fGn, fRn

Together with the scaling symmetry presented in the previous section, we also assume the Gaussianity of the natural temperature variability. This Gaussian hypothesis was verified in DRAL for globally averaged monthly temperature in the macroweather regime. Although the Gaussian assumption is commonly made, it is worth underlining that it is somewhat surprising that it is a reasonable model for macroweather time series. Recall that Gaussian statistics imply that macroweather in time has little or no intermittency (the series are mono-, not multifractal, the transitions are not “spiky”). This contrasts with macroweather in space, which is highly intermittent, as well as the existence of highly intermittent, nonGaussian, multifractal spatial and temporal statistics in the weather and climate regimes (Lovejoy 2018).

The scaling of the temperature fluctuations and spectrum implies that there are power-law correlations in the system and hence a large memory effect that can be exploited. In Lovejoy (2019) and Lovejoy et al. (2021), it was argued that the origin of this memory are the Earth’s hierarchical, scaling energy storage mechanisms whereby anomalies in energy fluxes either external (e.g. anthropogenic) or internal can be stored for long periods. It was argued that to a good approximation, the temperature satisfies the Fractional Energy Balance Equation (FEBE) that has a high-frequency scaling storage term and a low-frequency energy balance term. When the FEBE is internally forced by a Gaussian white noise, the temperature response is the statistically stationary fractional Relaxation noise (fRn) process (Lovejoy 2019).

However, at time scales shorter than the relaxation time (of the order of a few years), the (scaling) storage term is dominant and, for exponents \(-1/2<H<0\), the temperature response is a fractional Gaussian noise (fGn) process. This was the approximation made in DRAL and is empirically valid for all land areas and most of the oceans. The exceptions are some parts of the tropical ocean where \(0<H<1\) (Figs. 2 and 4a), we return to these below.

Fig. 4
figure 4

Estimates of the three parameters (\(H, {\sigma }_{T},{\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\)) obtained for each grid point and statistics of the innovations, \(\gamma \left(t\right)\). a Maximum likelihood estimates of the temperature fluctuation exponent (compare with the estimates shown in Fig. 2). There is a discontinuity from negative to positive values of \(H\) as we approach the tropical ocean, corresponding to the change in model from fGn to fBm. b The standard deviation, \({\sigma }_{T}\), of the infinite ensemble fGn process. c Map of the transient climate sensitivity, defined in Eq. (2). The places marked with “*” indicate pixels where the null hypothesis, \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}=0\), cannot be rejected with more than 90% confidence. d Histograms of the fluctuation exponent and the standard deviation of the innovations (\({H}_{\gamma }\) and \({\sigma }_{\gamma }\), respectively) for the 10,512 grid points. From the histograms, we can conclude that the innovations are very close to white noise for the whole planet (\({H}_{\gamma }=-0.498\pm 0.003\) and \({\sigma }_{\gamma }=1.000\pm 0.002\))

The original idea of modelling the natural variability using an fGn process was presented in Lovejoy et al. (2015) as the ScaLIng Macroweather Model (SLIMM). In DRAL, StocSIPS was introduced as a general system that includes SLIMM as the core prediction model. StocSIPS also improves the mathematical and numerical techniques of SLIMM. It was applied to the prediction of globally averaged temperature series since 1880. The comparison of StocSIPS hindcasts with Canada’s operational long-range forecast system, the Canadian Seasonal to Interannual Prediction System (CanSIPS), showed that StocSIPS is just as accurate for 1-month forecasts, but significantly more accurate for longer lead times.

In this paper we extend the globally averaged version of StocSIPS for the prediction of a single temperature time series to the prediction of the full space–time temperature field. The basic theory for fGn processes, used here to model those places where \(-1/2<H<0\) (most of the planet), is summarized in Appendix 1. An fGn process is fully characterized by two parameters (assuming zero mean): the fluctuation exponent, \(H\), and the standard deviation, \({\sigma }_{T}\).

We mentioned that for most places in the tropical ocean, \(0<H<1\). While these may still be modelled by fRn processes, the high-frequency approximation to fRn is no longer an fGn process, but rather a fractional Brownian motion (fBm) process, and we must use the correlation function for fRn given in Appendix 1.3, Eq. (23). For those regions with positive \(H\), the first differences of the temperature, \({\delta T}_{\text{nat}}\left(t\right)={T}_{\text{nat}}\left(t\right)-{T}_{\text{nat}}\left(t-1\right)\), has \(H\) values reduced by 1, so for \({\delta T}_{\text{nat}}\) we also have \(-1<H<0\). That is, either the natural temperature variability itself or its first differences can be modelled by an fGn process. In those places where \(H>0\) for the high frequencies, it would be equivalent to modelling them with an fBm or fRn process. Of course, a true fBm would only have one scaling regime with positive fluctuation exponent, instead of the bi-scaling regime shown for the detrended anomalies in Fig. 3c. To model those series as an fBm process is an approximation that would work well for the high frequencies, but that would fail in reproducing the low frequency behaviour.

2.4.2 Parameter estimates and model adequacy

With the distinction in the tropical region where we take the first differences to adjust everything to an fGn model, we conclude that to model the actual temperature field for the globe (including the anthropogenic trend), for each grid point of the NCEP/NCAR Reanalysis 1 data we may estimate the three parameters (\(H\), \({\sigma }_{T}\), \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\)). For the first two, we use the maximum likelihood method described in Appendix 1 of DRAL and for the sensitivity we use the regression described in Sect. 2.1. To verify the model adequacy, we use Eq. (22) to obtain the residual innovations, \(\gamma \left(t\right)\), then, using the maximum likelihood method, we obtain its variance, \({\sigma }_{\gamma }\), and its fluctuation exponent, \({H}_{\gamma }\); they should be equal to 1 and \(-1/2\), respectively (white noise processes are particular cases of fGn with \(H=-1/2\)). The results are summarized in Fig. 4.

A map of the maximum likelihood estimates of the temperature fluctuation exponent is shown in Fig. 4a. These values are more accurate and give a better fit of our model than the high-frequency Haar estimates shown in Fig. 2. Notice that for most of the globe and all of the land, the values are in the range \(-1/2<H<0\), which is characteristic of long-range memory fGn processes with nonsummable correlation functions, i.e. the sum over \(\Delta t\) of the series with elements given by Eq. (17) diverges for this range of \(H\). There is a discontinuity from negative to positive values of \(H\) as we approach the tropical ocean, corresponding to the change in model from fGn to fBm (or, equivalently, from the description as an fGn of the natural temperature variability, \({T}_{\text{nat}}\), to the description of the temperature differences, \({\delta T}_{\text{nat}}\)). In most of the tropical ocean (red regions in the map), the natural temperature variability has fluctuation exponents in the range \(0<H<1/2\), whose fBm approximation has “anti-persistent” increments (consecutive increments are negatively correlated). Only in the eastern equatorial Pacific (yellow region in the map), do we obtain fluctuation exponents in the range \(1/2<H<1\), whose fBm approximation has persistent (positively correlated) increments. It is significant that it is precisely this more predictable region that is associated with the ENSO phenomenon (Trenberth 1997), the strongest interannual signal of climate variability on Earth.

In Fig. 4b we show the values of the parameter \({\sigma }_{T}\). Although this is the standard deviation of the infinite ensemble fGn process, for a given finite realization it does not coincide with the usual estimate (SDT) based on the temporal average:

$$SD_{T}^{2} = \frac{1}{N}\sum\limits_{t = 1}^{N} {\left[ {T_{{{\text{nat}}}} \left( t \right) - \overline{T}_{N} } \right]^{2} } ,$$
(5)

where \({\stackrel{-}{T}}_{N}={\sum }_{t=1}^{N}{T}_{\text{nat}}\left(t\right)/N\) (the over-bar notation is used to denote averaging in time). The biased estimate \({SD}_{T}\) ignores correlations, that are however considered in the maximum likelihood estimate of \({\sigma }_{T}\). The relation between the two values for fGn processes depends on the length of the time series and the fluctuation exponent, \(H\), and is given by:

$$SD_{T}^{2} = \sigma_{T}^{2} \left( {1 - N^{2H} } \right)$$
(6)

(see Sect. 3.3 and Appendix 1 of DRAL). Notice that there is also a discontinuity in the map of \({\sigma }_{T}\) for the same reasons explained previously. In general, the amplitude of the fluctuations is larger over land than over the ocean; the surface temperature over the ocean is less variable as this has a higher thermal inertia than land.

A map of the transient climate sensitivity, defined in Eq. (2), is shown in Fig. 4c. The places marked with “*” indicate grid boxes where the null hypothesis,\({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}=0\), cannot be rejected with more than 90% confidence. Notice that these values depend on the reference dataset. In our case we used the NCEP/NCAR Reanalysis 1, which only has data since 1948. More precise estimates of the climate sensitivity were obtained by Hébert and Lovejoy (2018) using five observational datasets since 1880. In this paper, we are not aiming at an accurate study of the climate sensitivity. We should consider the values of \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\) reported here as a parameter used for detrending the temperature time series related to the anthropogenic effects.

Finally, in Fig. 4d, we show histograms of the fluctuation exponent and the standard deviation of the innovations (\({H}_{\gamma }\) and \({\sigma }_{\gamma }\), respectively) for the 10,512 grid points. From the histograms, we conclude that the innovations are very close to white noise for all the places in the planet (\({H}_{\gamma }=-0.498\pm 0.003\) and \({\sigma }_{\gamma }=1.000\pm 0.002\)). So, with a high degree of accuracy, all the innovation series can be considered \(\text{NID}\left(\text{0,1}\right)\) (Normally and Independently Distributed with mean 0 and variance 1), and we can conclude that the fGn model is a good fit to the natural temperature variability (or its increments in the red and yellow places of the map in Fig. 4a).

3 Results

3.1 Natural variability forecast

3.1.1 Model validation through hindcast

In the previous section, we validated the fGn model as a good fit to the natural temperature variability (or to its increments) by checking the whiteness of the residual innovations. The goal of this section is to further validate the model by using the theory presented in Appendix 1.4 to hindcast only the natural variability—not the anthropogenic signal or the annual cycle—and seeing how well it performs. We test the assumptions made in the model by comparing the theoretically expected skill scores (expected values if the model were perfect) with the actual scores obtained from hindcasts. All the verification metrics used in this paper are detailed in Appendix 2.

Series of hindcasts at monthly resolution, were produced for forecast horizons from 1 to 12 months, in the period of verification (POV) from December 1950 to November 2019 (the verification starts in December in order to have the same number of conventional seasons: DJF, MAM, JJA, SON). In this 69-year verification period, each month was independently predicted using the information available \(m\) months before. For each horizon, \(k\), we used a memory \(m=20\) months. For example, to predict the average temperature for December 1950 with \(k=1\) month, we used the previous 21 months, including November 1950, and the same was done for every verification date up to November 2019 and for all horizons up to \(k=12\) months. The dependence with the horizon of many scores [e.g. the root mean square error (\(\text{RMSE}\))], is obtained from the difference between hindcasts series at a fixed \(k\) and the corresponding series of observations.

It is important to point out that the predictor \({\widehat{T}}_{\text{nat}}\left(t+k\right)\) (see Eq. 24) only depends on the previous \(m+1\) months, from \({T}_{\text{nat}}\left(t-m\right)\) to \({T}_{\text{nat}}\left(t\right)\), weighted by coefficients that only depend on the fluctuation exponent \(H\) (see Fig. 4a). The estimates of \(H\) are quite robust and only small variations were obtained for different training periods, as long as the length of the training periods is larger than one third of the full length of the time series. Also, only small changes on the skill were found for small variations in \(H\). In that sense, given the robustness of the estimates of the fluctuation exponent, we can use almost all the observational period for verification leaving only a few months before the first initialization date to use as memory. In all cases, the observational and forecast anomalies used for verification were calculated in the leave-one-out cross-validation mode.

3.1.1.1 Root Mean Square Error (RMSE)

The infinite ensemble expectation of the \(\text{RMSE}\) is given in Appendix 1.4 (Eq. 27). This analytical expression is only a function of the model parameters and does not include any observational data. It is the theoretical \(\text{RMSE}\) value for a perfect model. To confirm the validity of the theoretical framework for the prediction of the natural variability component, we compare these expected values for each grid point with the actual verification \(\text{RMSE}\) obtained from hindcasts in the POV from December 1950 to November 2019. The all-month verification score for horizon \(k\) is obtained using Eqs. (32) and (33) with \(N=828\) months and \({T}_{\text{nat}}\left(t+k\right)\) and \({\widehat{T}}_{\text{nat}}\left(t+k\right)\) being the zero mean detrended observational and predicted anomalies, respectively.

The comparison between the theoretical and the actual (obtained from hindcasts) normalized root mean square error (\(\text{NRMSE}\)) is shown in Fig. 5 for horizon \(k=1\) month. The \(\text{NRMSE}\) is the \(\text{RMSE}\) normalized by the observed standard deviation (Eq. (5) for the natural variability). The NRMSE may vary from zero to infinity, with lower \(\text{NRMSE}\) values indicating more skillful forecasts. \(\text{NRMSE}\) values greater than 1 indicate that forecasts are less skillful than the climatological average value of the series. As we pointed out, for a fixed k, the theoretical \(\text{RMSE}\) only depends on the parameters \({\sigma }_{T}\) and \(H\). In general, there is very good agreement between theory and verification results. The maximum difference between the two maps in Fig. 5 is lower than 0.07. The forecast skill is higher over ocean than over land and takes the highest values over the tropical ocean, which corresponds to the spatial distribution of \(H\) values shown in Figs. 2 and 4a.

Fig. 5
figure 5

Theoretical and hindcasts \(\text{NRMSE}\) for \(k=1\) month. The corresponding \(\text{RMSE}\)s were obtained using Eqs. (27) and (33), respectively, and the normalization standard deviation from Eq. (5) for the natural variability

The maps in Fig. 5 were obtained for \(k=1\) month, but similar maps can be obtained for all forecast horizons from 1 to 12 months. The results of the comparison can be summarized in the scatter plots shown in Fig. 6. The graphs include the 10,512 grid points, showing the verification \(\text{RMSE}\) obtained from hindcasts vs. the expected theoretical \({\text{RMSE}}_{\text{nat}}^{\text{theory}}\) predicted by Eq. (27) for each horizon. As expected, the agreement between the theoretically expected scores and the hindcasts results decreases as the horizon increases, but it remains quite accurate in all cases with a correlation coefficient larger than 0.998. For the regions where \(H>0\), the fBm fit is less accurate; however, recall that in those places the actual statistics of the fluctuations are bi-scaling, while the fBm model assumes a perfectly scaling process. The accuracy of the theory decreases as the horizon approaches the transition time, \({\tau }_{w}\).

Fig. 6
figure 6

Scatter plots for each horizon including the 10,512 grid points, showing the verification RMSE obtained from hindcasts vs. the expected theoretical \({\text{RMSE}}_{{{\text{nat}}}}^{{{\text{theory}}}}\) predicted by Eq. (27). The graphs were displaced vertically by 2 °C (plus a horizontal displacement of 8 °C for k ≥ 7 months) for visual clarity. The black line at 45° is a reference indicating perfect agreement between theory and verification. The blue points represent locations where H < 0 and the natural variability is modeled as an fGn process and the red points are for places where H > 0 and we use the fBm model

3.1.1.2 Mean Square Skill Score (MSSS)

Related to the \(\text{RMSE}\) score, the MSSS is a commonly used metric (see Eq. 35). The guidelines of the World Meteorological Organization (WMO) Standard Verification System for Long-Range Forecasts (LRFs) (WMO 2010a), suggests the MSSS as a metric for deterministic forecasts (based on the ensemble mean). For leave-one-out cross-validated data in the POV (WMO 2010a), the mean square error (\(\text{MSE}\)) of the reference climatology forecasts (including the deterministic anthropogenic trend forecast) is:

$${\text{MSE}}_{{\text{C}}} = \left( {\frac{N}{N - 1}} \right)^{ \, 2} SD_{T}^{2}$$
(7)

(see Eq. 34), where \({{SD}_{T}}^{2}\) is the variance of the detrended anomaly series (natural variability component). The MSSS for horizon \(k\) for the natural variability forecast is:

$${\text{MSSS}}_{{{\text{nat}}}} \left( k \right) = 1 - \frac{{{\text{MSE}}_{{{\text{nat}}}} \left( k \right)}}{{\left( {\frac{N}{N - 1}} \right)^{ \, 2} SD_{T}^{2} }},$$
(8)

where \({\text{MSE}}_{\text{nat}}\) is obtained using Eq. (32) with \(N=828\) months and \({T}_{\text{nat}}\left(t+k\right)\) and \({\widehat{T}}_{\text{nat}}\left(t+k\right)\) being the zero mean detrended observational and predicted anomalies, respectively.

One consequence of the memory effects in the natural variability is the increase of \({{SD}_{T}}^{2}\) with the length of the verification period given by Eq. (6). This implies that some metrics, such as the MSSS or the \(\text{NRMSE}\), will actually have the same dependence with the duration of the verification period. The longer the verification period, the higher the value of MSSS (lower for \(\text{NRMSE}\)), even for a fixed prediction system (with fixed RMSE). Comparisons between skill scores of different models should always be made for the same POV (or at least the same length of the POV). As the number of months used for verification increases, \({{SD}_{T}}^{2}\to {{\sigma }_{T}}^{2}\) and the MSSS approaches the asymptotic value (determined by \(H\)). This effect is small for most values of \(H\), but is significant if too short verification periods are used or if \(H\) is close to zero (e.g.: the bias \({{SD}_{T}}^{2}/{{\sigma }_{T}}^{2}\approx 0.6\) for \(N=100\) months and \(H=-0.1\)). See Fig. 9 in DRAL for an example in monthly globally averaged temperature.

3.1.1.3 Temporal correlation coefficient (TCC)

The \(\text{TCC}\) is another commonly used verification score for deterministic forecasts (see Eq. 36). For the natural variability forecast, the \(\text{TCC}\) for horizon \(k\) is:

$${\text{TCC}}_{{{\text{nat}}}} \left( k \right) = \frac{{\overline{{T_{{{\text{nat}}}} \left( {t + k} \right)\hat{T}_{{{\text{nat}}}} \left( {t + k} \right)}} }}{{SD_{T} \sqrt {\overline{{\hat{T}_{{{\text{nat}}}} \left( t \right)^{2} }} } }},$$
(9)

where the overbars indicate temporal average for a constant \(k\).

For the natural variability forecast, the autoregressive coefficients in our predictor were obtained as analytical functions of only the fluctuation exponent, \(H\) (see Eqs. 24 and 25). As we showed in Appendix 2.3, if our model is adequate for describing the natural temperature variability, then the following relationship between the verification \({\text{TCC}}_{\text{nat}}\) and \({\text{MSSS}}_{\text{nat}}\) should be satisfied for \(k=1\) month:

$${\text{TCC}}_{{{\text{nat}}}} \left( 1 \right) \approx \sqrt {{\text{MSSS}}_{{{\text{nat}}}} \left( 1 \right)} .$$
(10)

It does not hold for all horizons in the tropical region due to the use of the fBm rather than fGn model.

In Fig. 7 we show maps of the \({\text{TCC}}_{\text{nat}}\) and the absolute difference \(\left|{\text{TCC}}_{\text{nat}}-\sqrt{{\text{MSSS}}_{\text{nat}}}\right|\) obtained from hindcasts for \(k=1\) month. The color scale in (b) was rescaled 100 times with respect to (a) so the differences could be perceptible. They are negligible compared to the values in (a). The maximum differences in Fig. 7b is almost always lower than 0.01 (mean value of 0.001), which strongly corroborates the adequacy of the fGn model to describe the natural variability.

Fig. 7
figure 7

Maps of \({\text{TCC}}_{\text{nat}}\) and the absolute difference \(\left|{\text{TCC}}_{\text{nat}}-\sqrt{{\text{MSSS}}_{\text{nat}}}\right|\) obtained from hindcasts for \(k=1\) month. The colour scale in b was rescaled 100 times with respect to a so the differences could be perceptible

3.1.2 Probabilistic scores and reliability

All the skill scores discussed above are recommended by the WMO for assessing deterministic prediction of long-range forecasts (WMO 2010b). These forecasts are deterministic in the sense that only the ensemble mean is considered, disregarding the ensemble variance, or more accurately, the prediction of the probability distribution. In this study we only focus on deterministic predictions (deterministic in the previously mentioned sense, recall that we use a stochastic model) because, given a Gaussian approximation of a probability distribution function, the skill of probabilistic forecasts is mainly dependent upon the skill of ensemble mean predictions and much less upon predictions of ensemble variances (Kryjov et al. 2006). In fact, in DRAL it was shown that, assuming a Gaussian distribution for the errors, the Continuous Ranked Probability Score (\(\text{CRPS}\)) (Hersbach 2000; Gneiting et al. 2005), which is a commonly used metric for probabilistic forecasts, is related to the \(\text{RMSE}\) by:

$${\text{CRPS}}\left( k \right) = \frac{{{\text{RMSE}}\left( k \right)}}{\sqrt \pi }\left[ {\sqrt {2\left( {1 + {\text{ESS}}} \right)} - \sqrt {{\text{ESS}}} } \right],$$
(11)

where:

$${\text{ESS}} = \frac{{\overline{{\sigma_{{{\text{ensemble}}}}^{2} }} }}{{{\text{MSE}}}}$$
(12)

is the ensemble spread score, defined as the ratio between the temporal mean of the intra-ensemble variance, \(\overline{{\sigma_{{{\text{ensemble}}}}^{2} }}\), and the mean square error between the ensemble mean and the observations (Palmer et al. 2006; Keller and Hense 2011; Pasternack et al. 2018). The \(\text{ESS}\) is a commonly used metric to evaluate the reliability of the probabilistic forecast of an ensemble model.

For the case of StocSIPS, which by definition is a Gaussian model with ensemble spread \({\sigma }_{\text{ensemble}}={\text{RMSE}}_{\text{nat}}^{\text{theory}}\) (given by Eq. (27)), the agreement between \({\text{RMSE}}_{\text{nat}}^{\text{theory}}\) and \({\text{RMSE}}_{\text{nat}}\) (summarized in Fig. 6 for all horizons) implies that \(\text{ESS}\approx 1\) almost everywhere.

The graphs shown in Fig. 6 are analogous to spread-error scatterplots (Leutbecher and Palmer 2008). In our case, each point represents the ensemble spread and the temporal average \(\text{RMSE}\) for each pixel, instead of the spatially averaged values shown in Fig. 4 of Leutbecher and Palmer (2008). We could group up and average the values in equally populated bins to produce more similar spread-error plots, but as they all fall near to the reference diagonal, the conclusions would remain the same. Other measures used to assess the reliability [like the error-spread score (Christensen et al. 2015)] depend on the third or higher order moments of the forecast probability distribution. Since the StocSIPS forecast is Gaussian by definition, the \(\text{ESS}\) used here (Eq. 12) gives enough information assuming the near Gaussianity of the observational probability distribution.

In Fig. 8 we show maps of the \(\text{ESS}\) of StocSIPS for horizon from 1 to 4 months. Notice that, from Eq. (27), \({\sigma }_{\text{ensemble}}={\text{RMSE}}_{\text{nat}}^{\text{theory}}\) is a function of the forecast horizon and the location, following the spatial distribution of the model parameters \({\sigma }_{T}\) and \(H\), but for all pixels the \(\text{ESS}\)s are very close to 1, except for the tropical ocean where it tends to be “overdispersive” (\(\text{ESS}>1\)). The average values for the globe with one standard deviation are shown in brackets in each map label. They increase monotonically from \(0.96\pm 0.05\) for \(k=1\) month, \(0.98\pm 0.03\) for \(k=2\) months, \(0.98\pm 0.04\) for \(k=3\) months, \(1.00\pm 0.06\) for \(k=4\) months, up to \(1.09\pm 0.21\) for \(k=12\) months (only the first four values are included in the maps). From Eq. (11), it can be shown that for a system with perfect reliability where \(\text{ESS}=1\), the \(\text{CRPS}\) takes its minimum value \({\text{CRPS}}_{\text{min}}=\text{RMSE}/\sqrt{\pi }\). For any other case when we have an “overconfident” (\(\text{ESS}<\)1) or an “overdispersive” (\(\text{ESS}>1\)) system, \(\text{CRPS}>\text{RMSE}/\sqrt{\pi }\). In conclusion, StocSIPS is a nearly perfectly reliable system (except for the tropical ocean) without need of recalibration of the forecast probability distribution.

Fig. 8
figure 8

Maps of \(\text{ESS}\) of StocSIPS for horizons \(k\) from 1 to 4 months (ad, respectively). The values of the \(\text{ESS}\) are very close to 1, with the exception of the tropical ocean where it tends to be “overdispersive” (\(\text{ESS}>\)1). The average values for the globe with one standard deviation are shown in brackets in the map labels

3.2 Hindcast verification

3.2.1 Monthly and 3-month average predictions

The results presented in Sect. 3.1 confirm the validity of the stochastic model on forecasting the natural temperature variability. In this section, we show the verification scores for the forecast of the raw (undetrended) anomalies including the forecast of the CO2eq deterministic trend. All the scores were computed following the definitions shown in Appendix 2.

Given the smooth variation of the CO2eq concentration at monthly scales, we can use simple extrapolation in Eq. (2) to obtain the predictor \({\widehat{T}}_{\text{anth}}\left(t+k\right)\) from the knowledge of the CO2eq concentration path up to time \(t\). As the function \({T}_{\text{anth}}\left(t\right)\) is almost linear in a \(k\)-vicinity of any \(t\), the error of projecting the anthropogenic component is negligible compared to the error of the natural variability. In fact, as we assume the same global CO2eq forcing affecting all locations, the error of predicting the anthropogenic trend for a given \(k\), is proportional to the sensitivity map shown in Fig. 4c. It was found that this error is lower than 2% of the \(\text{RMSE}\) of the natural variability for all locations and for all horizons. In any case, the projection of the trend was still included in the following verification results.

3.2.1.1 Normalized root mean square error (NRMSE)

Figure 9 shows maps of the \(\text{NRMSE}\) for horizons \(k=1\), 2 and 3 months (panels a–c, respectively) and for the seasonal forecast (including all seasons, average for k = 1–3 months) in panel (d). The values in brackets in the figure labels are the \(\text{NRMSE}\) globally area averaged over the grid points (see Eq. 39). In general, the skill of the forecasts is larger over ocean than over land, with the lower values of \(\text{NRMSE}\) attained over the tropical ocean. This corresponds to the distribution of \(H\) shown in Figs. 2 and 4a.

Fig. 9
figure 9

Normalized root mean square error \(\text{NRMSE}\) for: (a) \(k=1\) month, (b) \(k=2\) months, (c) \(k=3\) months and (d) for the all-seasons mean (average for k = 1–3 months). The values in brackets in the figure labels represent the areal mean of global \(\text{NRMSE}\)

Since small \(\text{NRMSE}\) implies large skill, according to the global-averaged \(\text{NRMSE}\), the seasonal skill is larger than that of any of the first three individual monthly forecasts. This is possible because although the horizon is further in the future, the seasonal forecast is for a longer (3 months) average. For scaling processes, the two effects exactly compensate. For the prediction of the natural variability component using fGn, the skill on predicting the next month using monthly averaged data is the same as the skill on predicting the next season using 3-month averaged data. This is reflected in Eq. (27), where \(k\) is in units of \(\tau\), which is the resolution (smallest sampling temporal scale) of the data. The similarity between the average values in the captions of panels (a) and (d) of Figs. 9, 10 and 11 confirms this consequence of the scaling.

Fig. 10
figure 10

Mean square skill score (MSSS) for: (a) \(k=1\) month, (b) \(k=2\) months, (c) \(k=3\) months and (d) for the all-seasons mean (average for k = 1–3 months). The values in brackets in the figure labels represent the areal mean of global MSSS

Fig. 11
figure 11

Anomaly correlation coefficient (\(\text{TCC}\)) for: (a) \(k=1\) month, (b) \(k=2\) months, (c) \(k=3\) months and (d) for the all-seasons mean (average for k = 1–3 months). The values in brackets in the figure labels represent the areal mean of global \(\text{TCC}\)

The values in Fig. 9a for the forecast of the raw anomalies are lower than those shown in Fig. 5b for the natural variability because, while the \(\text{RMSE}\) of both are almost the same (we can neglect the error on projecting the anthropogenic trend), the normalization factor (standard deviation of the respective anomalies) is larger for the undetrended anomalies.

3.2.1.2 Mean Square Skill Score (MSSS)

To compute the MSSS for the raw anomalies, the \(\text{MSE}\) of the reference climatology forecasts (forecast produced using only the annual cycle signal without removing the anthropogenic variation) is in this case:

$${\text{MSE}}_{{\text{C}}} = \left( {\frac{N}{N - 1}} \right)^{ \, 2} SD_{{{\text{anom}}}}^{2} ,$$
(13)

where \({{SD}_{\text{anom}}}^{2}\) is the variance of the anomalies series without removing the anthropogenic component:

$$SD_{{{\text{anom}}}}^{2} = \overline{{T_{{{\text{anom}}}}^{2} }} = \overline{{\left( {T_{{{\text{anth}}}} + T_{{{\text{nat}}}} } \right)^{2} }} = \overline{{T_{{{\text{anth}}}}^{2} }} + SD_{T}^{2}$$
(14)

(assuming that the natural and anthropogenic variabilities are independent).

Because \({{SD}_{\text{anom}}}^{2}>{{SD}_{T}}^{2}\) and the \(\text{MSE}\) of the forecast of the raw and the detrended anomalies are almost equal, then from Eq. (8) we obtain that the MSSS for the undetrended series forecast is larger than for the natural variability.

Maps of \(\text{MSSS}\), corresponding to those shown in Fig. 9, are shown in Fig. 10. The difference in skill between ocean and land is more evident in these maps. In many places over land, the \(\text{MSSS}\) is close to zero, meaning that most of the skill comes from the projection of the anthropogenic trend. The global averages shown in brackets in the map labels are computed following the guidelines of the WMO (2010a). Note that the maps and the average values shown in Figs. 9 and 10 are related as \(\text{MSSS}\approx 1-{\text{NRMSE}}^{2}\) if the reference forecast for the \(\text{MSSS}\) is the climatological annual cycle.

3.2.1.3 Temporal correlation coefficient (TCC)

Similarly to the \(\text{MSSS}\), if the \(\text{TCC}\) is obtained for the undetrended anomalies (with only the annual cycle, but not with the anthropogenic trend removed), then, because of the extra correlation associated to the trend, higher values are often obtained compared to the \(\text{TCC}\) for the natural variability. For most of the long-term forecasts reported in the literature only the annual cycle is removed: the increasing trend related to anthropogenic warming is kept to obtain the anomalies used for verification, resulting in artificially boosted skill scores.

In Fig. 11, we show maps of the \(\text{TCC}\) for the prediction of the raw anomalies. The number in brackets in the caption of each plot indicates the area-averaged over the globe of the grid-point correlation coefficients. The area average was computed taking the Fisher Z-transform of the correlations following Eq. (42) (Fisher 1915; WMO 2010b). The StocSIPS predictions over the ocean are highly correlated with the observations and the highest correlations are in the tropical regions. Over land, although the skill is poorer (using the correlation coefficient), it is still significantly high for the forecast of the first 3 months. The \(\text{TCC}\) of the prediction is positive almost everywhere and, compared to the \(\text{NRMSE}\) or the \(\text{MSSS}\), it shows significantly larger skill. This “extra” skill shown in the correlations for the raw anomalies comes from the presence of the anthropogenic signal.

3.2.2 Global averages

To summarize, in Fig. 12 we show graphs of the area-averaged \(\text{NRMSE}\), \(\text{MSSS}\) and \(\text{TCC}\) for the monthly and the 3-month average forecasts as a function of the forecast horizon. In all the graphs, the red lines with circles correspond to the average considering the grid points for the whole planet, the blue lines with open squares are for places over the ocean and the green lines with triangles are for grid points over land. The corresponding dashed lines of the same colours represent the respective scores obtained if only the anthropogenic trend is forecast. In all cases, the reference forecast is the climatological annual cycle. Attending to the average values, we can conclude that the skill over ocean is always greater than over land, with the global skill in between the two.

Fig. 12
figure 12

Graphs of the area-averaged \(\text{NRMSE}\), \(\text{MSSS}\) and \(\text{ACC}\) for the monthly (ac) and the 3-month average (df) forecasts as a function of the forecast horizon. In all the graphs, the red lines with circles correspond to the average considering the grid points for the whole planet, the blue lines with open squares are for places over the ocean and the green lines with triangles are for grid points over land. The corresponding dashed lines of the same colours represent the respective scores obtained if only the anthropogenic trend is forecast

As we mentioned previously, for a perfectly scaling process, the 3-month average forecasts for k = 1–3 months would have the same skill as the monthly forecast for \(k=1\) month. In the same way, the seasonal for k = 4–6 months would correspond to the monthly for \(k=2\) months, for k = 7–9 months to \(k=3\) months and for k = 10–12 months to \(k=4\) months. A comparison between panels (a) and (d) and (b) and (e) in Fig. 12, show that this is reasonably well satisfied for the \(\text{NRMSE}\) and \(\text{MSSS}\), respectively. Of course, the actual comparison should be made for the forecast of the natural temperature variability, which is the true scaling process. The results shown in Fig. 12 are for the raw anomalies which include the anthropogenic trend, that breaks the scale invariance of the fluctuations.

The difference between the curves and the dashed horizontal lines (showing the skill if only the anthropogenic trend is forecast) corresponds to the skill on forecasting the natural variability using the fGn model. While the forecast of the natural variability is reasonably skillful for \(k\le 6\) months, for horizons larger than 6 months, most of the overall skill comes from projecting the anthropogenic trend. Finally, note that the global and ocean averages vary monotonically with \(k\), but the land averages show some oscillation that indicates a seasonality effect in the forecasts. This seasonality is analyzed in the next section.

3.2.3 Multiplicative seasonality

The results shown in panel (d) of Figs. 9, 10 and 11, were obtained for the seasonal forecast without distinguishing specific seasons. In fact, StocSIPS assumes that each month has the same anomaly statistics. It is actually this month-to-month correlation that is exploited as a source of predictability in the stochastic model. Nevertheless, there is always an intrinsic multiplicative seasonality in the data that is impossible to completely remove without affecting the scaling behaviour. This seasonal interannual variability is shown in Fig. 13, where the standard deviation of the 3-month averaged anomalies is shown for each conventional season: (a) December to February (DJF), (b) March to May (MAM), (c) June to August (JJA) and (d) September to November (SON). The difference in the variability between the spring and the fall seasons (panels (b) and (d)) is low. In comparison, the interannual variability over the land area in the northern hemisphere is larger during the boreal winter (DJF) and lower during the summer (JJA).

Fig. 13
figure 13

Interannual standard deviation (SD) of the temperature anomalies for the conventional seasons: (a) DJF, (b) MAM, (c) JJA and (d) SON. The values in brackets in the figure labels represent the areal mean of global standard deviation and the areal mean excluding the poles (between 60°S and 60°N)

The largest seasonality is observed in the polar regions, where the winter temperatures are much more intermittent compared to the summer values. Conversely, during the summer the standard deviation of the anomalies in the poles is much lower compared to the other seasons. The values in brackets in the figure labels represent the areal mean of global standard deviation, \(\langle \text{SD}\rangle\), and the areal mean excluding the poles (between 60°S and 60°N), \({\langle \text{SD}\rangle }_{-60}^{+60}\). Notice that the polar regions contribute substantially to the interannual variability and also, that the boreal winter season is in general significantly more variable than the other seasons (the \(\langle \text{SD}\rangle\) goes from 1.16 °C for DJF to roughly 1.03 °C for the others). A possible explanation for this seasonality is that, when removing the annual cycle and the trend associated with the anthropogenic warming, we assumed that both were statistically independent. This is not completely true for the polar region. While for the rest of the planet the anthropogenic temperature response increases uniformly for every month following the increasing CO2 concentrations, in the poles during the summer, the temperature is tied to the freezing point of water. This is a shortcoming of the model that could be considered to improve future versions of StocSIPS.

3.2.4 Preliminary comparison with GCMs’ seasonal predictions

In the previous sections, we validated StocSIPS as a good model for describing the monthly surface temperature field and we assessed its skill by computing monthly and 3-month average scores, without distinguishing specific seasons. To account for the effects of the multiplicative seasonality on the predictions, we can stratify the observations and the forecasts series to show dependencies with the targeted season and the forecast horizon. Usually, this is the kind of forecast published by several major operational centers for seasonal prediction. In this section we show the skill scores obtained for stratified data and we make a preliminary comparison with other models’ skill to assess the relative advantages and shortcomings of StocSIPS.

Our purpose in this paper is not to make an exhaustive and detailed comparison with other long-term prediction models’ results. This detailed comparison and also the combination of StocSIPS with conventional numerical models to produce merged forecasts is the subject of a future paper currently in preparation. Those results are too extensive to include them in the present paper, so we limited ourselves to compare with already published skill scores from other models. For this purpose, we selected the multi-model ensemble (MME) predictions recently published by Kim et al. (2020). An important aspect is that Kim et al. offer a detailed description of the scores and the methods used, which we try to closely follow here to guarantee reproducibility. The definitions of these metric are given in Appendix 2, following the guidelines of the WMO Standardized verification system for long-range forecasts (SVS-LRF) (WMO 2010a, b).

In Kim et al. (2020), the authors assess different MME combination methods for seasonal prediction using hindcast datasets of six models from five Global Producing Centers (GPCs) for long-range forecasts (LRFs) designated by the WMO (Graham et al. 2011). The six models included in their analysis cover 27 years of common hindcast period from 1983 to 2009. The selected GPCs were: Melbourne, Montreal (two models), Moscow, Seoul and Tokyo. References and details of the individual models can be found in Kim et al. (2020). The authors study seven experimental deterministic MME methods to merge the six seasonal forecast systems: simple composite method (SCM), simple linear regression (SLR), multiple linear regression (MLR), best selection anomaly (BSA), multilayer perceptron (MLP), radial basis function (RBF) and genetic algorithm (GA). Their reported scores for 2-m temperature were obtained for 1-month lead retrospective forecasts in a grid with a resolution of 2.5° in both longitude and latitude. To produce the figures in this section, we used and adapted some of the figures from Kim et al. (2020) (including supporting information).

3.2.4.1 Mean Square Skill Score (MSSS)

For a better comparison with Kim et al. results, all the seasonal scores for StocSIPS were obtained for observational and forecast seasonal anomalies calculated as departures from the climatology in the leave-one-out cross-validation scheme for the period 1983–2009. In Fig. 14, we show maps of the \(\text{MSSS}\) of StocSIPS for: (a) DJF, (b) MAM, (c) JJA and (d) SON. In all cases, the forecasts used data up to the beginning of each respective season (average for k = 1–3 months), i.e. including November for DJF, up to February for MAM and so on. The values in brackets in the figure labels represent the globally averaged score, \(\langle \text{MSSS}\rangle\), (see Eq. 40). In panels (e) and (f) we reproduce maps of \(\text{MSSS}\) for DJF and JJA, respectively, from Figs. S1 and S2 of Kim et al. (2020) (supporting information) for their best MME combination method (GA).

Fig. 14
figure 14

\(\text{MSSS}\) for: (a) DJF, (b) MAM, (c) JJA and (d) SON. In all cases, the forecasts used data up to the beginning of each respective season (average for k = 1–3 months). The values in brackets in the figure labels represents the globally averaged \(\text{MSSS}\) (see Eq. (40)). The maps shown in e and f for the GCMs MME predictions of DJF and JJA, respectively, were reproduced from Figs. S1 and S2 of Kim et al. (2020) (supporting information) for their best MME combination method (GA). Kim et al. (2020) is an open access article distributed under the terms of the Creative Commons CC BY license, which permits unrestricted use, distribution, and reproduction in any medium

In agreement with the previous results shown in Fig. 10 for the independent months and the all-season \(\text{MSSS}\), the predictions for the individual seasons in general show better skill over the ocean than over land. The \(\text{MSSS}\) values are particularly high in the tropical region with the highest values obtained in the equatorial Pacific for DJF. Similar results were obtained for the GCM forecasts shown in Fig. 14e,f. The globally averaged scores (shown in the top right corner of each plot), are identical for StocSIPS and the MME results: 0.17 and 0.22 for DJF and JJA, respectively. The negative values of \(\text{MSSS}\) for StocSIPS near the north pole for JJA are associated to the multiplicative seasonality effect described in Sect. 3.2.3.

3.2.4.2 Temporal correlation coefficient (TCC)

Similar to Fig. 14 for the \(\text{MSSS}\), in Fig. 15 we show maps of the \(\text{TCC}\) of StocSIPS for: (a) DJF, (b) MAM, (c) JJA (d) SON and the best GCMs MME combination (GA) from Kim et al. (2020) in panels (e) and (f) for DJF and JJA, respectively. The values in brackets in the figure labels represent the globally averaged score, \(\langle \text{TCC}\rangle\), computed using Eq. (42). As before, the highest correlation values are achieved in tropical regions. Considering the average scores, there is no significant reduction in the \(\text{TCC}\) for DJF compared to JJA. There are also no considerably low values near the north pole for JJA. Compared to the \(\text{MSSS}\), the multiplicative seasonality effects are less reflected in the \(\text{TCC}\), since the latter is a measure of the skill in predicting the phase (sign), so less dependent on the variability of the anomaly magnitudes.

Fig. 15
figure 15

\(\text{TCC}\) for: (a) DJF, (b) MAM, (c) JJA and (d) SON. In all cases, the forecasts used data up to the beginning of each respective season (average for k = 1–3 months). The shaded areas indicate the regions over the 5% significance level using two-tailed student’s t-test. The values in brackets in the figure labels represent the globally averaged score, \(\langle \text{TCC}\rangle\), computed using Eq. (42). The maps shown in e and f for the GCMs MME prediction of DJF and JJA, respectively, were reproduced from Figs. S5 and S6 of Kim et al. (2020) (supporting information) for their best MME combination method (GA)

3.2.4.3 Anomaly pattern correlation coefficient (ACC)

The temporal evolution of the forecast skill can be assessed using the anomaly pattern correlation coefficient (\(\text{ACC}\)), which is the spatial correlation for any given date between the observational and forecast anomalies (see Eq. 37). This shows how well the model reproduces the temperature anomaly distribution around the globe for any given season. Figure 16 shows the evolution of the \(\text{ACC}\) for StocSIPS (black line with solid circles) and for each of the seven MME combination methods studied by Kim et al. (colored lines with markers) in the 27-year verification period 1983–2009. Graphs for each season are shown independently: (a) MAM, (b) JJA, (c) SON and (d) DJF. This figure was adapted from Fig. 3 in Kim et al. (2020) to include the StocSIPS scores. In the original figure, the authors also show the absolute value of the El Niño 3.4 index (black line without markers) to study the dependence of the ensemble predictions with the El Niño phase. The main conclusion is that the GCM MMEs perform better during ENSO events than during non-ENSO events for all seasons. A similar behaviour was not found for the case of StocSIPS, where the performance based on the \(\text{ACC}\) varies independently of the ENSO phase. The average scores for the POV (see Eq. 43) are shown in the right panels for each of the respective seasons. Comparing these values, we can see that StocSIPS has better overall skill than most of the GCM MME combinations for all seasons. Only for JJA, the StocSIPS score is lower than the best three MME (SCM, SLR and GA). For the rest of the seasons, its average score is almost equal (or slightly larger) than the best MME (using GA or SCM).

Fig. 16
figure 16

\(\text{ACC}\) for StocSIPS (black line with solid circles) and for each of the seven MME combination methods studied by Kim et al. (colored lines with markers) in the 27-year verification period 1983–2009 for: (a) MAM, (b) JJA, (c) SON and (d) DJF. The average scores for the POV (see Eq. (43)) are shown in the right panels for each of the respective seasons. The absolute value of the El Niño 3.4 index (black line without markers) is also shown. This figure was adapted from Fig. 3 in Kim et al. (2020) to include the StocSIPS scores

3.2.4.4 Globally averaged TCC and RMSE

Comparisons for globally averaged \(\text{TCC}\) and \(\text{RMSE}\) (see Eqs. (39) and (42)) are shown in Fig. 17a,b, respectively, for each season. The bars are for the MME combination methods in Kim et al. (2020), together with the mean of single model skills (MSMS). The scores for StocSIPS were included as horizontal lines with the same color code as the bars for each respective season. The dashed black line indicates that the estimated \(\text{TCC}\) is statistically significant at the 5% level using the one-tailed Student's t test. The GA methods shows the best performance, although it is very close to the SCM with equal weights for each model. Most MME predictions show higher skill than the corresponding MSMS for all four seasons, although sometimes (like the \(\text{TCC}\) for MLP), the MME combination does not improve over the single model predictions. In all cases, the \(\text{TCC}\) of the StocSIPS forecasts is larger than the best GCM MME. Similarly, the StocSIPS \(\text{RMSE}\) is lower than most of the MME combinations for all seasons. Only the SCM, GA and SLR show lower errors than StocSIPS for JJA predictions. The globally averaged \(\text{TCC}\) does not show a large seasonal variation, but there is still a reduction in skill for JJA and DJF associated to the high variability in the poles discussed in Sect. 3.2.3. This multiplicative seasonality effect is clear in the average \(\text{RMSE}\), which follows the average \(\text{SD}\) values in the caption of Fig. 13.

Fig. 17
figure 17

Globally averaged \(\text{TCC}\) (a) and \(\text{RMSE}\) (b) (Eqs. (42) and (39), respectively) for MAM (red), JJA (blue), SON (green) and DJF (purple) for the period 1983–2009. The bars are for the MME combination methods in Kim et al. (2020), together with the mean of single model skills (MSMS). The scores for StocSIPS were included as horizontal lines with the same color code for each respective season. The dashed black line indicates that the estimated \(\text{TCC}\) is statistically significant at the 5% level using the one-tailed Student's t test. This figure was adapted from Figs. 5 and 6 in Kim et al. (2020) to include the StocSIPS scores

For an overall comparison, in Fig. 18 we show a plot of the 4-season-averaged \(\text{RMSE}\) vs. \(\text{TCC}\) for the six individual models used in Kim et al. (2020) (red crosses) and the seven MME combinations (letters). The StocSIPS scores were included as a blue asterisk. For the GCMs, the GA method has the best performance—very close to the SCM—with the highest \(\text{TCC}\) (0.51) and the lowest \(\text{RMSE}\) (0.64). The StocSIPS forecasts have similar \(\text{RMSE}\) (0.64), but better average \(\text{TCC}\) (0.55).

Fig. 18
figure 18

4-season-averaged \(\text{RMSE}\) vs. \(\text{TCC}\) for the six individual models used in Kim et al. (2020) (red crosses), the seven MME combinations (letters) and StocSIPS (blue asterisk). This figure was adapted from Fig. 7 in Kim et al. (2020) to include the StocSIPS scores

4 Summary and discussion

In this paper we applied the Stochastic Seasonal to Interannual Prediction System (StocSIPS) to the monthly and seasonal prediction of the surface temperature with a 2.5° × 2.5° spatial resolution. The theory and the basis of the numerical methods used in StocSIPS were previously presented and applied to the forecast of globally averaged temperature in Del Rio Amador and Lovejoy (2019). StocSIPS is based on two statistical properties of the macroweather regime: the near Gaussianity of temperature fluctuations and the temporal scaling symmetry of the natural variability. The model is a high-frequency approximation to the Fractional Energy Balance Equation (FEBE), a fractional generalization of the usual EBE.

StocSIPS models the temperature series at each grid point independently as a superposition of a periodic signal corresponding to the annual cycle, a low-frequency deterministic trend from anthropogenic forcings and a high-frequency stochastic natural variability component. The annual cycle can be estimated directly from the data and is assumed constant in the future, at least for horizons of a few years. The anthropogenic component is represented as a linear response to equivalent CO2 forcing and can be projected very accurately 1 year into the future by using only one parameter: the climate sensitivity, itself obtained from linear regression with historical emissions. Finally, the natural variability is modeled as a discrete-in-time fGn process which is completely determined by the variance and the fluctuation exponent. That gives a total of only three parameters for each grid point for modeling and predicting the surface temperature. Those parameters are quite robust and can be estimated with good accuracy from past data. The same procedure could be extended to any other field assuming it satisfies the Gaussianity and the scaling behaviour of the fluctuations.

Although we mentioned that the fGn with fluctuation exponent in the range \(-1/2<H<0\) is a good model for the natural variability, a distinction must be made for most of the tropical ocean region, for which a positive fluctuation exponent was found. Instead of using the fGn model there, we must use the general fRn model or its high frequency fBm approximation with \(0<H<1\). It is significant that within this tropical ocean region, only in the more predictable region that is associated with the ENSO phenomenon we obtain fluctuation exponents in the range \(1/2<H<1\), whose fBm approximation has persistent (positively correlated) increments.

It is surprising that by using only three parameters for each location (the fluctuation exponent, \(H\), the standard deviation, \({\sigma }_{T}\), and the transient climate sensitivity, \({\lambda }_{2\times {\text{CO}}_{2}\text{eq}}\)), we can build a model that accurately describes the temperature field. The adequacy of the model was verified by testing the whiteness of the residual innovations and validating the theoretically expected scores (if the model were perfect) vs. the actual hindcasts results. This also implies that, for probabilistic forecast, StocSIPS is a nearly perfectly reliable system without need of recalibration of the forecast probability distribution.

The hindcast verification results show that the skill is generally greater over the ocean than over land, in particular over the more persistent tropical ocean region. One of the implications of the scaling that was verified is that the 3-month average forecast has the same skill as the 1 month ahead monthly prediction. This is possible because although the horizon is further in the future, the seasonal forecast is for a longer (3 months) average. For scaling processes, the two effects exactly compensate.

The seasonal predictions show a decreased skill in the polar regions during the summer. A possible explanation for this seasonality is that, when removing the annual cycle and the trend associated with the anthropogenic warming, we assumed that both were statistically independent. This is not true for the polar region. While for the rest of the planet the anthropogenic temperature response increases uniformly for every month following the increasing CO2 concentrations, in the poles during the summer, the temperature is tied to the freezing point of water. This spurious seasonality introduced in the preprocessing of the data, can be corrected in future versions of StocSIPS to improve the global forecasts.

Besides this seasonality near the poles, the globally averaged skill score values are also lower during the boreal winter. This can be explained by the asymmetric distribution of land mass between the northern and the southern hemispheres and the fact that the atmospheric temperature near the surface is more stable over ocean than over land. Further improvements in the model may be possible using recalibration of the individual forecasts for every season.

Although the purpose of this paper is not to make a detailed and exhaustive comparison with other long-term prediction models, it is important to at least show a preliminary comparison with already published skill scores from other models to assess the advantages and shortcomings of StocSIPS. The evaluation against seven different MME combination methods using six models from the Lead Centers for Long-range forecasts published by Kim et al. (2020), showed that the skill scores obtained with StocSIPS are comparable (or better in the case of the temporal correlation coefficient) than the best MME combination (which has larger skill than any individual ensemble member). This is in agreement with the previous results in Del Rio Amador and Lovejoy (2019) that show that StocSIPS outperformed the Canadian MME (CanSIPS) for all but the first month of the forecast. This preliminary comparison for seasonal forecast validates StocSIPS as a good alternative and a complementary option to conventional numerical models.

StocSIPS and GCMs are based on entirely different approaches. While the GCMs only take the initial state of the system (with perturbations to produce multiple ensemble realizations), they do exploit all possible interactions with other atmospheric variables and other locations to produce their forecast through the integration of the dynamical equations. Conversely, StocSIPS neglects all the spatial (and other variables) relations to produce forecasts based on the past states at any single location by exploiting the large memory of the system. Another way to view this is that for forecasts, GCMs are initial value models that generate many “stochastic” realizations of the state of the atmosphere, whereas StocSIPS is effectively a “past value problem” that directly estimates the most probable future state (conditional expectation).

Although there is no evident mechanism that explains how the distant past affects the current state of the system, model reduction as explained by the Mori-Zwanzig formalism (Mori 1965; Zwanzig 1973, 2001; Gottwald et al. 2017) shows that if we only look at one part of the system (e.g. the temperature at a given location), memory effects arise. All the interactions coming from other degrees of freedom are embedded in the past values. Recent works (Lovejoy et al. 2015; Lovejoy 2019; Lovejoy et al. 2021) hypothesize that, for the case of temperature, scaling behaviour are a result of a hierarchy of energy storage mechanisms acting at different temporal and spatial scales. In addition, it was shown that a scaling Mori-Zwanzig effect naturally arises even from the classical heat equation (Lovejoy 2021ab).

One evident question that arises from our treatment is why not to exploit the teleconnections in the temperature field to improve the forecast instead of predicting each series independently. To answer this, in a recent publication (Del Rio Amador and Lovejoy 2021), StocSIPS was extended to the multivariate case (m-StocSIPS), to include and realistically reproduce all the space–time cross-correlation structure. By using Granger causality, it was shown that, although large spatial correlations exist in the temperature field, the optimal predictor of the temperature at a given location is obtained from its own past if long enough time series are given. These cross-correlations “were already used” to build that past. This means that the predictions given here (in the univariate StocSIPS version) are optimal in this stochastic framework. Improvements on the MSSS values of only 1–2% are possible, which is in the noise level of our current predictions, so in that sense, the forecast of the individual series is nearly optimal. Nevertheless, the fact that the GCMs remain “deterministic” up to approximately 1–2 years over the oceans (mostly in the tropics) and in the poles, where having a dynamic sea ice model is apparently crucial for subseasonal to seasonal forecasts (Zampieri et al. 2018), suggests that StocSIPS can be combined with GCM outputs to produce a single hybrid forecasting system that improves on both.