1 Introduction

Spatio-temporal data is becoming increasingly prevalent in areas ranging from climatology, ecology, epidemiology, cellular biology to the social sciences. As such, developing models for spatio-temporal data that are able to describe the structures present in the data and produce accurate predictions is an important area of statistical research. In this paper, we apply a framework for improving predictions for models for spatio-temporal data to modelling rainfall data in the Murray-Darling Basin (MDB) in Australia. However, the framework can be applied generally to models for spatio-temporal data.

Before detailing the framework for improving predictions, we first provide a brief outline of the general approach used for modelling spatio-temporal data. Let \(Y(\varvec{s}_i,t)\) denote the spatio-temporal data value at spatial location \(\varvec{s}_i\) and time point t, where \(i=1,\ldots ,N\) and \(t=1,\ldots ,T\). In matrix notation, we can express the data as a \(T\times N\) matrix \(\varvec{Y}\), where the rows correspond to time points and the columns correspond to spatial locations. As a general model for \(Y(\varvec{s}_i,t)\), following Banerjee et al. (2015), we write

$$\begin{aligned} Y(\varvec{s}_i,t)=\mu (\varvec{s}_i,t)+e(\varvec{s}_i,t), \end{aligned}$$
(1)

where \(\mu (\varvec{s}_i,t)=E(Y(\varvec{s}_i,t))\) denotes the mean spatio-temporal structure and \(e(\varvec{s}_i,t)\) denotes the zero-mean residual component. Additional models and distributional assumptions can be applied to both the mean structure \(\mu (\varvec{s}_i,t)\) and the residual component \(e(\varvec{s}_i,t)\).

Consider now the problem of predicting a new data value \(Y^*=Y(\varvec{s}^*,t^*)\), at some spatial location \(\varvec{s}^*\) and time point \(t^*\), given some training data \(\varvec{Y}^{\text {tr}}\). A common approach is to use the estimated expected value \({{\hat{\mu }}}(\varvec{s}^*,t^*)={{\hat{E}}}(Y^*)\) as the prediction for \(Y^*\). Depending on how \(\mu (\varvec{s}_i,t)\) in (1) was further modelled, computing the expected value \({{\hat{\mu }}}(\varvec{s}^*,t^*)\) may require estimating various model parameters from the training data. A better approach to prediction is based on minimising the mean squared prediction error. If \(g(\varvec{Y}^{\text {tr}})\) denotes a predictor of \(Y^*\), the mean squared prediction error (MSPE) is given by

$$\begin{aligned} E\left[ \left( Y^*-g\left( \varvec{\varvec{Y}^{\text {tr}}}\right) \right) ^2\right] . \end{aligned}$$

The predictor that minimises the MSPE is \(E\left( Y^*|\varvec{Y}^{\text {tr}}\right) \), the conditional expectation of \(Y^*\) given the training data \(\varvec{Y}^{\text {tr}}\). Under some distributional assumptions on the data, we can derive an expression for \(E\left( Y^*|\varvec{Y}^{\text {tr}}\right) \). If \(Y^*\) and \(\varvec{y}^{\text {tr}}\) are jointly normally distributed, where \(\varvec{y}^{\text {tr}}\) denotes the vectorised form of the training data, it can be shown that

$$\begin{aligned} E\left( Y^*|\varvec{y}^{\text {tr}}\right) =E(Y^*) + \text {Cov}(Y^*,\varvec{y}^{\text {tr}})\text {Var}(\varvec{y}^{\text {tr}})^{-1} \left( \varvec{y}^{\text {tr}}-E(\varvec{y}^{\text {tr}})\right) . \end{aligned}$$
(2)

This is of course the familiar expression for Best Linear Unbiased Predictors (BLUPS; Henderson 1950; Robinson 1991) that also underlies the method of kriging for spatial interpolation (see Krige 1951, 1962; Cressie 1990). Its advantage over the estimated expected value (the first term in (2)) is that it brings in and optimally weights information from observations in the training data that are correlated with the new data value. Note that the normality assumption is not a strict requirement to use (2) for producing predictions as it can also be derived without assuming normality (Harville 1976). However in this case, the sense in which the predictions are “best” is weaker because it is constrained by the linearity and unbiasedness requirements. Nonetheless, provided the data are approximately normal (or at least not strongly non-normal), then predictions from (2) are likely to be an improvement over using only \(E(Y^*)\). Further, these improved predictions can be produced for many models.

To implement (2), we need to compute estimates of the \(E(Y^*)\), \(\text {Cov}(Y^*,\varvec{y}^{\text {tr}})\), \(\text {Var}(\varvec{y}^{\text {tr}})\) and \(E(\varvec{y}^{\text {tr}})\) terms, which will require estimating any necessary model parameters from the training data. Note that this framework for improved predictions will apply to any model where we can compute these terms. Ideally, we would like to use all of the training data to compute these terms because the optimality property requires doing so. However, depending on the size of the training data, computing the inverse of the \(NT\times NT\) matrix \(\text {Var}(\varvec{y}^{\text {tr}})\) can quickly become computationally prohibitive. We will consider using a reduced subset of the training data for computing estimates of \(\text {Cov}(Y^*,\varvec{y}^{\text {tr}})\), \(\text {Var}(\varvec{y}^{\text {tr}})\) and \(E(\varvec{y}^{\text {tr}})\), as this will decrease the number of calculations required. For example, we could use only recent time points or only neighbouring spatial locations in the training data. Depending on the type of prediction that is desired (e.g., predicting only at an unobserved time point, predicting only at an unobserved spatial location, or predicting at both an unobserved time point and spatial location), a particular method of subsetting the training data may be more appropriate than others. We explore various subsetting methods and compare their resulting predictive performance.

The motivation for improving prediction for spatio-temporal data stemmed from our work in Nowak et al. (2018). There we developed a hierarchical model for spatio-temporal rainfall data of the type suggested by Banerjee et al. (2015) and produced predictions in time and predictions in space. While the model performed reasonably well when predicting in time, it was outperformed by a naive single nearest neighbour-based prediction when predicting in space. This indicated that we were likely not sufficiently using the structure/information present in the data and that there was substantial scope to improve the predictions.

We apply the framework for improved prediction to a model for modelling spatio-temporal rainfall data in the MDB (described in Nowak et al. 2018). The predictions produced from applying the framework greatly improve the overall accuracy compared to those produced from the model using only \(E(Y^*)\). In the next section, we provide some background on the MDB rainfall data and the model used for modelling these data. In Sects. 3.1, 3.2 and 3.3 we then detail the application of the framework for improved prediction to this model, for predicting in time, predicting in space and predicting in both space and time, respectively.

2 Model for monthly rainfall data in the Murray-Darling Basin

The data analysed consisted of monthly rainfall data recorded across a network of weather stations in the MDB from which high-quality data were available. Specifically, \(Y(\varvec{s}_i,t)\) denotes the cube-root transformed monthly rainfall measurement, recorded across \(N=78\) weather stations over a period of \(T=986\) months (spanning from January 1923 to February 2005). In addition to the rainfall measurements, a number of spatial variables were available for each station: the Cartesian x and y coordinates, the elevation and the climatic region (top, middle, bottom) to which the station belonged. Further details can be found in Nowak et al. (2018).

The hierarchical model we used for the MDB data proposes models for the mean and residual components in (1). The mean is modelled as

$$\begin{aligned} \mu (\varvec{s}_i,t)=\sum _{j=0}^J\beta _j(\varvec{s}_i)f_j(t), \end{aligned}$$
(3)

where the \(\left\{ f_j(t)\right\} _{j=0}^J\) are smooth basis functions that describe temporal patterns that may be present in the data and \(\left\{ \beta _j(\varvec{s}_i)\right\} _{j=0}^J\) are spatially-varying coefficients that allow the temporal patterns to differ across locations. We included \(J=3\) basis functions. We set \(f_0(t)=1\) and then let \(f_1(t)\) and \(f_2(t)\) represent deterministic seasonal effects. The final basis function is empirically derived by applying a singular value decomposition (SVD) to the residuals after removing the effects of the deterministic basis functions from the data and setting \(f_3(t)\) to be the first left singular vector of the SVD.

The spatially varying coefficients \(\left\{ \beta _j(\varvec{s}_i)\right\} _{j=0}^J\) are further modelled by assuming the \(\varvec{\beta }_j=\left( \beta _j(\varvec{s}_1),\ldots ,\beta _j(\varvec{s}_N)\right) ^T\) satisfy

$$\begin{aligned} \varvec{\beta }_j \overset{\text {ind}}{\sim }N\left( \varvec{X}_j\varvec{\alpha }_j, \varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}\right) ,\quad j=0,\ldots ,J, \end{aligned}$$
(4)

where the columns of \(\varvec{X}_j\) represent a set of known spatial variables with \(\varvec{\alpha }_j\) denoting the unknown coefficients. The \(N\times N\) covariance matrix \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}\), where \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}[i_1,i_2]=\text {Cov}(\beta _j(\varvec{s}_{i_1}),\beta _j(\varvec{s}_{i_2}))\), is parameterised by an unknown vector \(\varvec{\theta }_{\varvec{\beta }_j}\). Based on empirical variograms of the residuals from regressing the estimated \(\varvec{\beta }_j\) on the spatial variables \(\varvec{X}_j\), an exponential covariance function was used for modelling the spatial covariances \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}\).

We assume the temporal patterns are captured in the mean structure through the basis functions, so the residuals \(e(\varvec{s}_i,t)\) are assumed to be independent over time and independent of the \( \beta _j(\varvec{s}_i)\). Letting \(\varvec{e}_t=\left( e(\varvec{s}_1,t),\ldots ,e(\varvec{s}_N,t)\right) ^T\), we assume

$$\begin{aligned} \varvec{e}_t \overset{\text {ind}}{\sim }N\left( 0,\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}\right) ,\quad t=1,\ldots ,T, \end{aligned}$$
(5)

where the \(N\times N\) covariance matrix \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}\), with \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}[i_1,i_2]=\text {Cov}(e(\varvec{s}_{i_1},t),e(\varvec{s}_{i_2},t))\), is parameterised by an unknown vector \(\varvec{\theta }_{\varvec{e}}\). Similar to \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}\), based on empirical variograms of the residuals, an exponential covariance function was used for modelling \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}\).

Given the model and the assumptions, we need to derive various covariance expressions to use in Sect. 3 when we construct improved predictions. It is not straightforward to do this but the process can be simplified if we treat the basis functions \(\left\{ f_j(t)\right\} _{j=0}^J\), once determined, as fixed because we can then formulate our hierarchical spatio-temporal model as a linear mixed model. This step (which is not included explicitly in this paper) could be used to fit the model (given the basis functions \(\left\{ f_j(t)\right\} _{j=0}^J\)) in a single step using maximum likelihood or restricted maximum likelihood (REML) to improve the efficiency of the parameter estimates, although there are advantages in developing the model to use a sequential stepwise approach as in Nowak et al. (2018). We do use the linear mixed model formulation to compute the covariances needed for improved prediction; it is still complicated but it is much easier than trying to compute the covariances without this intermediate step.

The covariance between two data values \(Y(\varvec{s}_{i_1},t_1)\) and \(Y(\varvec{s}_{i_2},t_2)\) is given by

$$\begin{aligned} \text {Cov}(Y(\varvec{s}_{i_1},t_1),Y(\varvec{s}_{i_2},t_2))=\sum _{j=0}^Jf_j(t_1)f_j(t_2)\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}[i_1,i_2]+I(t_1=t_2)\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}[i_1,i_2]. \end{aligned}$$

Letting the N-vector \(\varvec{y}_t=(Y(\varvec{s}_1,t),\ldots ,Y(\varvec{s}_N,t))^T\) denote the transpose of the tth row of the data matrix \(\varvec{Y}\), the covariance between \(\varvec{y}_{t_1}\) and \(\varvec{y}_{t_2}\) is given by the \(N\times N\) matrix

$$\begin{aligned} \text {Cov}(\varvec{y}_{t_1}, \varvec{y}_{t_2}) =\sum _{j=0}^Jf_j(t_1)f_j(t_2)\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}} + I(t_1=t_2)\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}, \end{aligned}$$
(6)

where \(\text {Cov}(\varvec{y}_{t_1},\varvec{y}_{t_2})[i_1,i_2]=\text {Cov}(Y(\varvec{s}_{i_1},t_1),Y(\varvec{s}_{i_2},t_2))\). More generally, if \(\varvec{y}_{t_i}^{N_i}\) denotes a generic \(N_i\)-vector of data values over \(N_i\) spatial locations at time \(t_i\), then \(\text {Cov}(\varvec{y}_{t_1}^{N_1}, \varvec{y}_{t_2}^{N_2})\) will be the \(N_1\times N_2\) matrix given by

$$\begin{aligned} \text {Cov}(\varvec{y}_{t_1}^{N_1}, \varvec{y}_{t_2}^{N_2}) =\sum _{j=0}^Jf_j(t_1)f_j(t_2){\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}}_{N_1\times N_2} + I(t_1=t_2){\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}}_{N_1\times N_2}, \end{aligned}$$
(7)

where \({\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}}_{N_1\times N_2}\) and \({\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}}_{N_1\times N_2}\) are the corresponding appropriate \(N_1\times N_2\) submatrices of \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}\) and \(\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}\), respectively. Letting the T-vector \(\varvec{y}_{\varvec{s}_i}{=}(Y(\varvec{s}_i,1),\ldots ,Y(\varvec{s}_i,T))^T\) denote the ith column of \(\varvec{Y}\), the covariance between \(\varvec{y}_{\varvec{s}_{i_1}}\) and \(\varvec{y}_{\varvec{s}_{i_2}}\) is given by the \(T\times T\) matrix

$$\begin{aligned} \text {Cov}(\varvec{y}_{\varvec{s}_{i_1}}, \varvec{y}_{\varvec{s}_{i_2}}) = \sum _{j=0}^J\varvec{f}_j\varvec{f}_j^T\varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}[i_1,i_2] + \varvec{I}_T\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}[i_1,i_2], \end{aligned}$$
(8)

where \(\varvec{f}_j=(f_j(1),\ldots ,f_j(T))^T\), \(\varvec{I}_T\) is the \(T\times T\) identity matrix and \(\text {Cov}(\varvec{y}_{\varvec{s}_{i_1}},\varvec{y}_{\varvec{s}_{i_2}})[t_1,t_2]=\text {Cov}(Y(\varvec{s}_{i_1},t_1),Y(\varvec{s}_{i_2},t_2))\). More generally, if \(\varvec{y}_{\varvec{s}_i}^{T_i}\) denotes a generic \(T_i\)-vector of data values across \(T_i\) times points at spatial location \(\varvec{s}_i\), then \(\text {Cov}(\varvec{y}_{\varvec{s}_{i_1}}^{T_1}, \varvec{y}_{\varvec{s}_{i_2}}^{T_2})\) will be the \(T_1\times T_2\) matrix given by

$$\begin{aligned} \text {Cov}(\varvec{y}_{\varvec{s}_{i_1}}^{T_1}, \varvec{y}_{\varvec{s}_{i_2}}^{T_2}) = \sum _{j=0}^J\varvec{f}_j^{T_1}\left( \varvec{f}_j^{T_2}\right) ^T \varvec{\Sigma }_{\varvec{\theta }_{\varvec{\beta }_j}}[i_1,i_2] + \varvec{I}_{T_1\times T_2}\varvec{\Sigma }_{\varvec{\theta }_{\varvec{e}}}[i_1,i_2], \end{aligned}$$
(9)

where \(\varvec{f}_j^{T_i}\) is the appropriate \(T_i\)-subvector of \(\varvec{f}_j\) and \(\varvec{I}_{T_1\times T_2}\) is the appropriate \(T_1\times T_2\) submatrix of \(\varvec{I}_T\).

2.1 Variable selection on the spatial variables

Before exploring the framework for improved prediction in the hierarchical model described above, we investigated whether incorporating variable selection on the spatial variables in (4) could improve the overall model. The spatial variables available for each station were the Cartesian x and y coordinates, elevation and the climatic region to which the station belonged. The hierarchical model of Nowak et al. (2018) used the same set of spatial variables to model each \(\varvec{\beta }_j\). We now consider allowing the set of spatial variables to potentially differ when modelling each \(\varvec{\beta }_j\), using a best subsets approach to select the best model.

For each station, let \(X_1\) and \(X_2\) denote the x and y coordinates, respectively, \(X_3\) denote the elevation, and \(I_T\) and \(I_M\) denote indicator variables for the top and middle regions, respectively. For each \(\varvec{\beta }_j\), we fitted 34 candidate models and selected the model that produced the smallest value of the Bayesian Information Criterion (BIC). These 34 candidate models were chosen based on some restrictions, namely, that the two climatic region indicator variables (\(I_T\) and \(I_M\)) must always appear together in a model and only second order interactions of each continuous variable (\(X_1\), \(X_2\) and \(X_3\)) with the indicator variables were considered.

For each \(\varvec{\beta }_j\), the selected models with the smallest BIC were the following:

$$\begin{aligned} \beta _0&{=} \alpha _{00}+\alpha _{01}X_1 + \alpha _{02}X_2 + \alpha _{03}X_3 + \alpha _{04}I_T + \alpha _{05}I_M + \alpha _{06}X_1I_T + \alpha _{07}X_1I_M + \epsilon \end{aligned}$$
(10)
$$\begin{aligned} \beta _1&{=} \alpha _{10} + \alpha _{11}X_1 + \alpha _{12}X_2 + \alpha _{13}X_3 + \alpha _{14} I_T + \alpha _{15}I_M + \alpha _{16}X_1I_T + \alpha _{17}X_1I_M + \epsilon \end{aligned}$$
(11)
$$\begin{aligned} \beta _2&{=} \alpha _{20} + \alpha _{21}X_1 + \alpha _{22}X_2 + \alpha _{24}I_T + \alpha _{25}I_M + \alpha _{26}X_1I_T + \alpha _{27}X_1I_M + \alpha _{28}X_2I_T\nonumber \\&\qquad + \alpha _{29}X_2I_M + \epsilon \end{aligned}$$
(12)
$$\begin{aligned} \beta _3&{=} \alpha _{30} + \alpha _{31}X_1 + \alpha _{32}X_2 + \alpha _{33}X_3+ \alpha _{34}I_T + \alpha _{35}I_M + \alpha _{36}X_1I_T + \alpha _{37}X_1I_M \nonumber \\&\qquad + \alpha _{38}X_2I_T + \alpha _{39}X_2I_M + \epsilon \end{aligned}$$
(13)

The parameter estimates for each model, i.e., the \(\hat{\varvec{\alpha }}_j\), for \(j=0,\ldots ,J\), are displayed in Table 1. Also displayed in this table are the standard errors of each parameter estimate, which were calculated using the block bootstrap approach described in Nowak et al. (2018). We note that the standard errors displayed in Table 1 are conditional on the particular selected model. That is, for each \(\varvec{\beta }_j\), the same corresponding model (either (10), (11), (12) or (13)) was fitted for each bootstrap sample.

Table 1 Estimates of \(\varvec{\alpha }_j\), for \(j=0,\ldots ,J\). Standard errors are given in brackets

We next investigated whether this variable selection on the spatial variables leads to better predictive performance when using the estimate of the expected value \(E(Y^*)\) as the predictions. We note that predictions in time use only the estimates \(\hat{\varvec{\beta }}_j\), whereas predictions in space use only the estimates \(\hat{\varvec{\alpha }}_j\). Since variable selection on the spatial variables does not affect the estimates \(\hat{\varvec{\beta }}_j\), predictions in time will remain unchanged with or without variable selection. As such, we will only compare predictive performance for predictions in space. Using the estimated models (10) to (13), we calculated predicted values for each station via leave-one-station-out cross-validation. The root-mean-square error (RMSE) for the predictions was 1.0804, compared to 1.0828 without variable selection. We see that variable selection provides a very marginal improvement in predictive performance when predicting in space. As a visual comparison, the predicted and observed values, averaged over time, are displayed in Fig. 1 (no variable selection on the left, variable selection on the right). We see that the more accurate predictions for stations in the south-east region (blue circles with smaller radii) may be driving the lower overall RMSE. The selected models for \(\varvec{\beta }_j\) given in (10) to (13) will be used in Sects. 3.2 and 3.3 for predicting in space and predicting in both space and time, respectively.

Fig. 1
figure 1

Predicted (blue) and observed (green) values, averaged over time, for each station. The left plot displays the predicted values without variable selection and the right plot displays the predicted values with variable selection on the spatial variables

3 Applying the framework for improved prediction

Suppose we wish to predict a new data value \(Y(\varvec{s}^*,t^*)\). From (2), the improved prediction \({{\hat{Y}}}(\varvec{s}^*,t^*)\) is given by

$$\begin{aligned} {{\hat{Y}}}(\varvec{s}^*,t^*) = {{\hat{E}}}(Y(\varvec{s}^*,t^*)) + \widehat{\text {Cov}}(Y(\varvec{s}^*,t^*), \varvec{y}^{\text {tr}})\widehat{\text {Var}}(\varvec{y}^{\text {tr}})^{-1} \left( \varvec{y}^{\text {tr}}-{{\hat{E}}}(\varvec{y}^{\text {tr}})\right) . \end{aligned}$$
(14)

To apply these improved predictions to the hierarchical model described in Sect. 2, we will need to calculate the appropriate expected value, covariance and variance terms in (14). How these terms are calculated from the hierarchical model will depend on the type of prediction desired. We will focus on three types of prediction: predicting at an unobserved time point, predicting at an unobserved spatial location, and predicting at an unobserved time point at an unobserved spatial location. For each type of prediction, we set aside some of the data to serve as a test data set on which to compare the improved predictions of (14) to predictions using only \({{\hat{E}}}(Y(\varvec{s}^*,t^*))\).

We noted in Sect. 1 that calculating the covariance and variance terms in (14) can be very computationally intensive for large data sets. As such, we explored the use of reduced subsets of the data to compute these terms. Further, we investigated how the prediction accuracy was affected by varying the size of the subsets. For each type of prediction, we considered three different methods of subsetting the data, which are described schematically in Figs. 24 and 6. Note that the reduced subsets were only used for computing the covariance and variance terms. Any model parameters that are required to calculate the expected value, covariance and variance terms in (14), that is, \(\left\{ \varvec{\beta }_j\right\} _{j=0}^J\), \(\left\{ \varvec{\alpha }_j\right\} _{j=0}^J\), \(\left\{ \varvec{\theta }_{\varvec{\beta }_j}\right\} _{j=0}^J\) and \(\varvec{\theta }_{\varvec{e}}\), were estimated using the full training data.

3.1 Prediction in time

Prediction in time refers to the problem of predicting the data value at a future time point for an existing or observed spatial location. That is, we are wanting to predict \(Y^{\text {tst}}=Y(\varvec{s}_i,k)\), for \(i=1,\ldots ,N\), for some future time point k. Therefore, we will set aside the most recent \(T^{\text {tst}}=12\) months of the data as the test data and use the remaining \(T^{\text {tr}}=974\) months as the training data. That is, \(\varvec{Y}^{\text {tst}}\) will be the \(T^{\text {tst}}\times N\) matrix consisting of the last 12 rows of \(\varvec{Y}\) and \(\varvec{Y}^{\text {tr}}\) will be the \(T^{\text {tr}}\times N\) matrix consisting of the first 974 rows of \(\varvec{Y}\).

The improved prediction \({{\hat{Y}}}(\varvec{s}_i,k)\) is calculated using (14), noting that

$$\begin{aligned} {{\hat{E}}}(Y(\varvec{s}_i,k)) = \sum _{j=0}^J{{\hat{\beta }}}_j(\varvec{s}_i)f_j(k), \end{aligned}$$
(15)

where we use the approach described in Nowak et al. (2018) to calculate the value of the basis functions \(f_j(t)\) at the future time point k. Specifically, the deterministic basis functions \(f_1(t)\) and \(f_2(t)\) were extended in the natural way. The empirically derived basis function \(f_3(t)\) was extended by setting the value at each future month to be the value at that month in the previous year. Predictions using (15) resulted in an RMSE of 0.9190 on the test data, which we will use as a baseline for comparison with our improved predictions. To feasibly compute \(\widehat{\text {Cov}}(Y(\varvec{s}_i,k),\varvec{y}^{\text {tr}})\) and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) in (14), we will investigate using reduced subsets of the training data. Figure 2 presents a graphical representation of the three subsetting methods used.

Fig. 2
figure 2

Three methods of subsetting the training data for computing the covariance and variance terms when predicting in time. Each rectangle represents the data matrix \(\varvec{Y}\) with months in the rows and stations in the columns. The test data to predict (at a future month) is indicated in red. Moving from left to right across the columns, the stations are assumed to be ordered in increasing distance from the test station. The first method subsets the most recent months across all stations. The second method subsets the test and nearby stations for all months. The third method subsets the most recent months for the test station

For subsetting in time, we selected the most recent l months across all N stations, i.e., the last l rows of \(\varvec{Y}^{\text {tr}}\). Specifically, we will set the vectorised subsetted training data to be the Nl-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}})^T,\ldots ,(\varvec{y}_{T^{\text {tr}}}^{\text {tr}})^T)^T\), where \(\varvec{y}_t^{\text {tr}}\) denotes the transpose of the tth row of \(\varvec{Y}^{\text {tr}}\). Since the same training data is used for predicting each station in the test data, we can efficiently calculate the predictions across all stations in one step. Letting \(\varvec{y}^{\text {tst}}_k\) denote the kth row of \(\varvec{Y}^{\text {tst}}\), the improved predictions for \(\varvec{y}^{\text {tst}}_k\) are given by

$$\begin{aligned} \hat{\varvec{y}}^{\text {tst}}_{k} = {{\hat{E}}}(\varvec{y}^{\text {tst}}_{k}) + \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{k},\varvec{y}^{\text {tr}})\widehat{\text {Var}}(\varvec{y}^{\text {tr}})^{-1}\left( \varvec{y}^{\text {tr}}-{{\hat{E}}}(\varvec{y}^{\text {tr}})\right) , \end{aligned}$$
(16)

where

$$\begin{aligned} {{\hat{E}}}(\varvec{y}^{\text {tst}}_{k})= & {} \sum _{j=0}^J\hat{\varvec{\beta }}_jf_j(k), \\ \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{k},\varvec{y}^{\text {tr}})= & {} \begin{bmatrix} \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{k}, \varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}})&\dots&\widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{k}, \varvec{y}_{T^{\text {tr}}}^{\text {tr}}) \end{bmatrix} \quad \hbox {and} \\ \widehat{\text {Var}}(\varvec{y}^{\text {tr}})= & {} \begin{bmatrix} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{T^{\text {tr}}-l+1},\varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{T^{\text {tr}}-l+1},\varvec{y}_{T^{\text {tr}}}^{\text {tr}})\\ \vdots &{} \ddots &{} \vdots \\ \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{T^{\text {tr}}},\varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{T^{\text {tr}}},\varvec{y}_{T^{\text {tr}}}^{\text {tr}})\\ \end{bmatrix}. \end{aligned}$$

The individual covariance terms are calculated using (6). Using (16), predictions for the test data were produced for \(l\in \{12,24,36,48,60\}\). The RMSE for each value of l is displayed (green line) in Fig. 3. We see that there is a small improvement in RMSE compared to predictions using (15). Further, subsetting the most recent four years (48 months) of the training data seems to be optimal, resulting in an RMSE of 0.9113.

For subsetting in space, for a given station, we selected all months from a set of neighbouring stations. That is, we subset \(\varvec{Y}^{\text {tr}}\) by selecting the columns corresponding to the set of neighbouring stations. Since the subsets vary for each station, predictions need to be produced separately for each station. For the ith station \(\varvec{s}_i\), let \(S^m_i=\{i_1,\ldots ,i_m\}\) denote the set of indices for the m nearest stations by distance (including the ith station) and set the vectorised subsetted training data to be the \(mT^{\text {tr}}\)-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}^{\text {tr}}_{\varvec{s}_{i_1}})^T,\ldots ,(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_m}})^T)^T\), where \(\varvec{y}_{\varvec{s}_n}^{\text {tr}}\) denotes the nth column of \(\varvec{Y}^{\text {tr}}\). Predictions for all months in the test data can then be efficiently calculated in one step. Letting \(\varvec{y}^{\text {tst}}_{\varvec{s}_i}\) denote the ith column of \(\varvec{Y}^{\text {tst}}\), the improved predictions are given by

$$\begin{aligned} \hat{\varvec{y}}^{\text {tst}}_{\varvec{s}_i} = {{\hat{E}}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i}) + \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}})\widehat{\text {Var}}(\varvec{y}^{\text {tr}})^{-1}\left( \varvec{y}^{\text {tr}}-{{\hat{E}}}(\varvec{y}^{\text {tr}})\right) , \end{aligned}$$
(17)

where

$$\begin{aligned} {{\hat{E}}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i})= & {} \sum _{j=0}^J{{\hat{\beta }}}_j(\varvec{s}_i)\varvec{f}_j^{T^{\text {tst}}}, \\ \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}})= & {} \begin{bmatrix} \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i}, \varvec{y}_{\varvec{s}_{i_1}}^{\text {tr}})&\dots&\widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i}, \varvec{y}_{\varvec{s}_{i_m}}^{\text {tr}}) \end{bmatrix}\quad \hbox {and} \\ \widehat{\text {Var}}(\varvec{y}^{\text {tr}})= & {} \begin{bmatrix} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_1}},\varvec{y}_{\varvec{s}_{i_1}}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_1}},\varvec{y}_{\varvec{s}_{i_m}}^{\text {tr}})\\ \vdots &{} \ddots &{} \vdots \\ \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_m}},\varvec{y}_{\varvec{s}_{i_1}}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_m}},\varvec{y}_{\varvec{s}_{i_m}}^{\text {tr}})\\ \end{bmatrix}. \end{aligned}$$

The individual covariance terms are calculated using (8) and (9). Using (17), predictions for the test data were produced for \(m\in \{1,2,3,4,5\}\). The RMSE for each value of m is displayed (blue line) in Fig. 3. The RMSE was mostly unchanged across different values of m, with the minimum of 0.9190 being achieved when \(m=1\). The RMSEs were also quite similar to using (15) for predictions.

We can draw some insights from the results of these two methods of subsetting the training data. In terms of subsetting in time, the greatest gains in prediction accuracy occur when selecting a recent set of months (e.g., 4 years) rather than all months in the training data. In terms of subsetting in space, there does not appear to be much gain from including any neighbouring stations in addition to the station itself. Combining these observations, the optimal subset of the training data to use when predicting a given station’s test data may simply be a selection of recent months from only the station itself.

Further to the above, for a given station, we subsetted the training data in both time and space by selecting the most recent \(l=48\) months from the given station. Specifically, for the ith station \(\varvec{s}_i\), we set the vectorised subsetted training data to be the l-vector \(\varvec{y}^{\text {tr}}=(Y^{\text {tr}}(\varvec{s}_i,T^{\text {tr}}-l+1),\ldots ,Y^{\text {tr}}(\varvec{s}_i,T^{\text {tr}}))^T\). We can then use (17) for prediction, with \(\widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}})\) and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) now calculated using (9). Using this method for subsetting the training data resulted in an RMSE of 0.9103 (red line in Fig. 3), which was the lowest among all subsetting methods.

Fig. 3
figure 3

RMSE over the test data for prediction in time for different methods of subsetting the training data. The green, blue and red lines denote subsetting methods 1, 2, and 3, respectively, of Fig. 2. The black line (which is just above the blue line) denotes predictions using (15)

3.2 Prediction in space

Prediction in space refers to the problem of predicting the data value at an unobserved spatial location at observed time points. The problem can effectively be thought of as spatial interpolation. Specifically, we want to predict \(Y^{\text {tst}}=Y(\varvec{s}_i,t)\), for \(t=1,\ldots ,T\), at some unobserved spatial location \(\varvec{s}_i\). We will use leave-one-station-out cross-validation to evaluate the performance of our improved prediction in space. That is, each station will serve as a “test station”. Therefore, for each \(i=1,\ldots ,N\), the test data \(\varvec{Y}^{\text {tst}}=\varvec{y}_{\varvec{s}_i}^{\text {tst}}\) will be the T-vector that is the ith column of \(\varvec{Y}\) and the training data \(\varvec{Y}^{\text {tr}}\) will be the \(T\times (N-1)\) matrix consisting of the remaining columns of \(\varvec{Y}\).

The improved prediction \({{\hat{Y}}}(\varvec{s}_i,t)\) is calculated according to (14), with

$$\begin{aligned} {{\hat{E}}}(Y(\varvec{s}_i,t)) = \sum _{j=0}^J{{\hat{\beta }}}_j(\varvec{s}_i)f_j(t) = \sum _{j=0}^J\varvec{x}_{j,i}^T\hat{\varvec{\alpha }}_j f_j(t), \end{aligned}$$
(18)

where \(\varvec{x}_{j,i}\) denotes the vector of spatial variables for location \(\varvec{s}_i\). Note that since \(\varvec{s}_i\) is now an unobserved location, the estimates \(\{{{\hat{\beta }}}_j(\varvec{s}_i)\}_{j=0}^J\) are unknown and need to be calculated according to (10) to (13). Predictions using (18) resulted in an RMSE of 1.0804 on the test data. As a naive comparison, when using the nearest station’s observed data values as the predictions for each test station, the RMSE was 0.6478. The naive predictions outperforming predictions using (18) indicates there is potential for the improved predictions of (14) to substantially increase predictive accuracy. For the improved predictions, we will again investigate using reduced subsets of the training data, with the details described in Fig. 4.

Fig. 4
figure 4

Three methods of subsetting the training data for computing the covariance and variance terms when predicting in space. Similar to Fig. 2, each rectangle represents the data matrix \(\varvec{Y}\) and the test data to predict (at an unobserved station) is indicated in red. The first method subsets the nearby stations for all months. The second method subsets a window of months around the test month for all stations. The third method subsets nearby stations for the test month

To subset in space, for each test station, we selected all months from a set of neighbouring stations. That is, for the ith test station \(\varvec{s}_i\), let \(S_i^m=\{i_1,\ldots ,i_m\}\) denote the set of indices for the m nearest stations by distance in the training data. The vectorised subsetted training data is then set to be the mT-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}^{\text {tr}}_{\varvec{s}_{i_1}})^T,\ldots ,(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_m}})^T)^T\), where \(\varvec{y}_{\varvec{s}_n}^{\text {tr}}\) denotes the nth column of \(\varvec{Y}^{\text {tr}}\). Improved predictions over all months in the test station, i.e., \(\varvec{y}_{\varvec{s}_i}^{\text {tst}}\), are calculated according to (17), but now with \({{\hat{E}}}(\varvec{y}_{\varvec{s}_i}^{\text {tst}})=\sum _{j=0}^J\varvec{x}_{j,i}^T\hat{\varvec{\alpha }}_j\varvec{f}_j\). Predictions for the test data were produced for \(m\in \{1,2,3,4,5\}\). The RMSE for each value of m is displayed (dark blue line) in Fig. 5. Even for \(m=1\), we see that the RMSE is lower than both the predictions using only (18) and the naive predictions. As we include more neighbouring stations, the RMSE continues to decrease, reaching a minimum of 0.5588.

To subset in time, for each month in the test station, we selected all stations for a window of months surrounding this month of interest. Since we are predicting at observed time points, we are able to consider data from both previous, current and future months, relative to the month of interest, for subsetting the training data. Specifically, for month t, let \(T_t^l=\{z\in {{\mathbb {Z}}}:\max \{t-l,1\}\le z\le \min \{t+l,T\}\}\), where \(l\ll T\), denote the set of indices corresponding to the window of \(|T_t^l|=\min \{t+l,T\}-\max \{t-l,1\}+1\) months centered at month t. Note that the window size ranges from a maximum of \(2l+1\) months (when \(l+1\le t\le T-l\)) to a minimum of \(l+1\) months (when \(t=1\text { and }T\)). The vectorised subsetted training data is therefore set to be the \((N-1)|T_t^l|\)-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}_{\max \{t-l,1\}}^{\text {tr}})^T,\ldots ,(\varvec{y}_{\min \{t+l,T\}}^{\text {tr}})^T)^T\). Since the vectorised subsetted training data will depend on the month t, for the ith test station \(\varvec{s}_i\) we calculated predictions separately for each month t using (14), where now \(\widehat{\text {Cov}}(Y(\varvec{s}_i,t),\varvec{y}^{\text {tr}})=\begin{bmatrix} \widehat{\text {Cov}}(Y(\varvec{s}_i,t),\varvec{y}_{\max \{t-l,1\}}^{\text {tr}})&\dots&\widehat{\text {Cov}}(Y(\varvec{s}_i,t),\varvec{y}_{\min \{t+l,T\}}^{\text {tr}}) \end{bmatrix}\) and

$$\begin{aligned} \widehat{\text {Var}}(\varvec{y}^{\text {tr}}) = \begin{bmatrix} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\max \{t-l,1\}},\varvec{y}_{\max \{t-l,1\}}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\max \{t-l,1\}},\varvec{y}_{\min \{t+l,T\}}^{\text {tr}})\\ \vdots &{} \ddots &{} \vdots \\ \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\min \{t+l,T\}},\varvec{y}_{\max \{t-l,1\}}^{\text {tr}}) &{} \dots &{} \widehat{\text {Cov}}(\varvec{y}^{\text {tr}}_{\min \{t+l,T\}},\varvec{y}_{\min \{t+l,T\}}^{\text {tr}})\\ \end{bmatrix}. \end{aligned}$$

The individual covariance terms were calculated using (6) and (7). Predictions for the test data were produced for \(l\in \{0,1,2,3,4\}\). The RMSE for each value of l is displayed (green line) in Fig. 5. The minimum RMSE of approximately 0.5526 was obtained for \(l=0\). This was similar to the minimum RMSE achieved when subsetting in space.

These results indicate that, when subsetting the training data in space, including more neighbouring stations improves prediction accuracy and, when subsetting in time, only training data from the month of interest are required. To explore this further, we subsetted the training data in both time and space. In detail, for each month in the test station, we selected a set of neighbouring stations only for this month of interest. Similar to the setup used for subsetting in space, for the ith test station \(\varvec{s}_i\), let \(S_i^m=\{i_1,\ldots ,i_m\}\) denote the set of indices for the m nearest stations by distance in the training data. For month t, the vectorised subsetted training data is set to be the m-vector \(\varvec{y}^{\text {tr}}=(Y^{\text {tr}}(\varvec{s}_{i_1},t),\ldots ,Y^{\text {tr}}(\varvec{s}_{i_m},t))^T\). For the ith test station \(\varvec{s}_i\), predictions were calculated separately for each month t again using (14), noting that the calculations of \(\widehat{\text {Cov}}(Y(\varvec{s}_i,t),\varvec{y}^{\text {tr}})\) and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) will now be somewhat simplified compared to when subsetting in time. Predictions for the test data were produced for \(m\in \{1,5,10,15,\ldots ,75\}\). The RMSE for each value of m is displayed (red line) in Fig. 5. While the minimum RMSE was achieved at \(m=60\) (0.5526), the RMSE was approximately 0.553 from \(m=15\) onwards. Therefore, for each month in the test station, optimal prediction accuracy can be achieved by subsetting on only 15 data values in the training data.

Fig. 5
figure 5

RMSE via leave-one-station-out cross-validation for prediction in space for different methods of subsetting the training data. The dark blue, green and red lines denote subsetting methods 1, 2 and 3, respectively, of Fig.  4. The black line denotes predictions using (18) and the light blue line denotes the naive prediction. The right plot is a zoomed view of the lower region of the left plot

3.3 Prediction in space and time

For spatio-temporal data, prediction in both space and time is the most challenging problem but is often of most interest. It involves predicting the data value \(Y^{\text {tst}}=Y(\varvec{s}_i,k)\) for some future time point k at an unobserved spatial location \(\varvec{s}_i\). For each \(i=1,\ldots ,N\), the test data \(\varvec{Y}^{\text {tst}}\) will be the vector of length \(T^{\text {tst}}=12\) corresponding to the last 12 rows (months) of the ith column (station) of \(\varvec{Y}\) and the training data \(\varvec{Y}^{\text {tr}}\) will be the \(T^{\text {tr}}\times (N-1)\) matrix consisting of the first \(T^{\text {tr}}=974\) rows of \(\varvec{Y}\) with the ith column removed.

The improved prediction \({{\hat{Y}}}(\varvec{s}_i,k)\) is calculated using (14), where now

$$\begin{aligned} {{\hat{E}}}(Y(\varvec{s}_i,k)) = \sum _{j=0}^J{{\hat{\beta }}}_j(\varvec{s}_i)f_j(k) = \sum _{j=0}^J\varvec{x}_{j,i}^T\hat{\varvec{\alpha }}_j f_j(k). \end{aligned}$$
(19)

Note that the estimates \(\{{{\hat{\beta }}}_j(\varvec{s}_i)\}_{j=0}^J\) need to be calculated according to (10) to (13), as was done when predicting in space, and the values of the basis functions \(f_j(t)\) at the future time point k are calculated similarly to when predicting in time. Given that we are using the same test data used when predicting in time, we will follow a similar method for subsetting the training data. We will use the RMSE obtained when using (19) for prediction (0.9555) and the minimum RMSE achieved when predicting in time (0.9103) as baseline comparisons. The subsetting methods are described in Fig. 6.

Fig. 6
figure 6

Three methods of subsetting the training data for computing the covariance and variance terms when predicting in space and time. The subsetting methods are the same as that used for prediction in time (Fig. 2), with the only difference being that data in the test station are no longer included in the training data. This is because we are now predicting at a future month at an unobserved station

For subsetting in time, we selected the most recent l months across the \(N-1\) stations in the training data. Hence we set the vectorised subsetted training data to be the \((N-1)l\)-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}})^T,\ldots ,(\varvec{y}_{T^{\text {tr}}}^{\text {tr}})^T)^T\). For each \(i=1,\ldots ,N\), the improved predictions for all months in the test station, i.e., \(\varvec{y}^{\text {tst}}_{\varvec{s}_i}\), are calculated according to (17), where \({{\hat{E}}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i})=\sum _{j=0}^J\varvec{x}_{j,i}^T\hat{\varvec{\alpha }}_j\varvec{f}_j^{T^\text {tst}}\),

$$\begin{aligned} \widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}}) = \begin{bmatrix} \widehat{\text {Cov}}(Y^{\text {tst}}(\varvec{s}_i,1), \varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}}) &{} \dots &{}\widehat{\text {Cov}}(Y^{\text {tst}}(\varvec{s}_i,1), \varvec{y}_{T^{\text {tr}}}^{\text {tr}})\\ \vdots &{} \ddots &{} \vdots \\ \widehat{\text {Cov}}(Y^{\text {tst}}(\varvec{s}_i,T^{\text {tst}}), \varvec{y}_{T^{\text {tr}}-l+1}^{\text {tr}}) &{} \dots &{}\widehat{\text {Cov}}(Y^{\text {tst}}(\varvec{s}_i,T^{\text {tst}}), \varvec{y}_{T^{\text {tr}}}^{\text {tr}}) \end{bmatrix} \end{aligned}$$

and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) is calculated as was done in (16). The individual covariance terms are calculated using (6) and (7). Predictions for the test data were produced for \(l\in \{12,24,36,48,60\}\). The RMSE for each value of l is displayed (green line) in Fig. 7. Similar to what was observed when predicting in time, the minimum RMSE of 0.9515 was achieved when \(l=48\).

For subsetting in space, for a given station, we selected all months from a set of neighbouring stations. For the ith station \(\varvec{s}_i\), let \(S^m_i=\{i_1,\ldots ,i_m\}\) denote the set of indices for the m nearest stations by distance in the training data. The vectorised subsetted training data is then the \(mT^{\text {tr}}\)-vector \(\varvec{y}^{\text {tr}}=((\varvec{y}^{\text {tr}}_{\varvec{s}_{i_1}})^T,\ldots ,(\varvec{y}^{\text {tr}}_{\varvec{s}_{i_m}})^T)^T\). For each \(i=1,\ldots ,N\), improved predictions for all months in the test station are again computed using (17), but now with \(\widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}})\) and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) calculated in a similar manner to (17). Predictions for the test data were produced for \(m\in \{1,2,3,4,5\}\). The RMSE for each value of m is displayed (blue line) in Fig. 7. Again, as was observed when predicting in time, the RMSE remained mostly unchanged across different values of m, with the minimum (0.9554) being achieved when \(m=1\).

Fig. 7
figure 7

RMSE over the test data for prediction in both space and time for different methods of subsetting the training data. The green, blue and red lines denote subsetting methods 1, 2 and 3, respectively, of Fig. 6. The black line (which is almost on top of the dark blue line) denotes predictions using (19) and the light blue line denotes the best predictions in time of Sect. 3.1

As a final comparison, we subsetted the training data in both time and space. Based on the results for subsetting in time and subsetting in space, for a given station, we selected the most recent \(l=48\) months for the nearest station in the training data (i.e., \(\varvec{s}_{i_1}\)). The vectorised subsetted training data is then the l-vector \(\varvec{y}^{\text {tr}}=(Y^{\text {tr}}(\varvec{s}_{i_1},T^{\text {tr}}-l+1),\ldots ,Y^{\text {tr}}(\varvec{s}_{i_1},T^{\text {tr}}))^T\). For each \(i=1,\ldots ,N\), (17) is again used to produce improved predictions for all months in the test station, where \(\widehat{\text {Cov}}(\varvec{y}^{\text {tst}}_{\varvec{s}_i},\varvec{y}^{\text {tr}})\) and \(\widehat{\text {Var}}(\varvec{y}^{\text {tr}})\) are now calculated using (9). The RMSE was 0.9537 and is displayed in red in Fig. 7.

Comparing these different methods of subsetting the training data, we see that the greatest prediction accuracy on the test data is attained when we subset on all stations for the most recent 48 months in the training data. However, using only the nearest station for the most recent 48 months produces a similar RMSE, indicating that it may not be necessary to subset on all stations in the training data. We note that regardless of the method used for subsetting the training data, the RMSE on the test data when predicting in both space and time is greater than the RMSE when predicting only in time. This is expected since, for each station in the test data, when predicting only in time we have the advantage of being able to use any predictive power contained in past data observed at this station.

4 Discussion

Predicting a new data value from a parametric model typically involves using an estimate of the expectation of the new data value as the predicted value. We have applied a framework for prediction for models for spatio-temporal data that improves on using only the expectation of the value to be predicted. A key advantage of this framework is that it is a general framework that can be applied to many parametric models for spatio-temporal data. The improved predictions were based on minimising the mean squared prediction error. The improved predictions have a strong optimality property under normality and a weaker optimality property when normality does not hold. As such, the framework has the potential to improve predictions over simply using the estimated mean regardless of the distributional assumptions on the data.

We applied the framework for improving predictions to a hierarchical model for monthly rainfall data in the Murray-Darling Basin. We focused on all the main types of prediction for spatio-temporal data, namely, prediction only in time, prediction only in space and prediction in both time and space. In all situations, the framework improved prediction accuracy compared to using only the expectation of the value to be predicted. In particular, the improvement in predictive accuracy was quite large when predicting only in space. This confirms the value of kriging for spatial prediction.

One potential limitation of the framework is that the calculations of the improved predictions may be computationally intensive. This is mainly due to the need to calculate or invert large covariance matrices when generating the predictions. However, we have demonstrated that large gains in predictive accuracy can still be achieved by using only a reduced subset of the data for computing these matrices. Specifically, for our model, we obtained good results for prediction in

  • Time using only the previous 48 months data from the station of interest;

  • Space using only data from 15 neighbouring stations for only the month of interest;

  • Space and time using data from all stations for only the most recent 48 months, although there is some evidence that fewer stations can be used.

The results for prediction in time are interesting because they are similar to what we expect when using an autoregressive time series model (for which we only need a small fixed number of lagged values for prediction) even though our model captures temporal effects through the basis functions \(\left\{ f_j(t)\right\} _{j=0}^J\) rather than through an autoregressive model. The results for prediction in space are similar to what we expect from kriging in spatial models. We believe that these results show that the framework provides a computationally feasible approach for improving predictions that should be applied to a wide variety of models for spatio-temporal data.

We have presented and evaluated the predictions made in this paper as point predictions. The assessment of uncertainty for these predictions using analytical approximations for the mean squared prediction errors or the bootstrap is quite complicated. A useful recent paper on the topic in the spatial context (Thiart et al. 2014) provides some optimism that bootstrap methods may be useful in our context. This is attractive because we have already developed a block bootstrap method to assess the uncertainty in the parameter estimates. The application of the method to estimating prediction uncertainty requires further investigation.