1 Introduction

In the last decade, there has been an increasing interest in modeling spatio-temporal data that result from dynamic processes in constant evolution in both space and time. Geostatistics has been used to deal with such spatio-temporal processes by providing covariance models to analyze the dependence structure. Since the eighties, a wealth of papers have provided a variety of spatio-temporal models that have been used in many scientific areas, such as hydrology, environmental sciences, geology, astronomy, neuroscience, ecology, atmospheric sciences, oceonography, and economy; this list is indeed virtually endless [see for instance, Eynon and Switzer (1983), Bilonick (1985), Oehlert (1993) among others]. Usually, the proposed models exhibit a hierarchical structure incorporating spatial or spatio-temporal dependencies. Indeed, approaches based on hierarchical spatio-temporal structures have been used to achieve flexible models that capture the dynamic behavior of the data. This has been the case in Brown et al. (1994) who applied a hierarchical model to a relatively low-dimensional space–time air pollution problem. Similarly, Handcock and Wallis (1994) used a Bayesian kriging approach for space–time modeling of meteorological fields. Waller et al. (1997) employed hierarchical Bayesian space–time models for mapping disease rates. Wikle et al. (1998) used this class of models in the analysis of monthly averaged maximum temperature data distributed in space and time, and Hughes et al. (1999) used hidden Markov models with unobserved weather states to model space–time atmospheric precipitation. We can also find applications in environmental pollution problems, where, for example, Ippoliti (2001) analyzed the levels of sulfur dioxide in Milan (Italy), providing a spatio-temporal state-space representation. Fasso et al. (2007) combined the state-space method with calibration techniques and applied them to fine particulate matter (PM\(_{10}\)) data in the spatio-temporal dimension. Cameletti et al. (2013) considered a hierarchical spatio-temporal model for particulate matter concentrations. This proposal involves a Gaussian Field, affected by a measurement error, and a state process characterized by a first-order autoregressive dynamic model and spatially correlated innovations. Another interesting spatio-temporal application for the interpolation of daily rainfall data using state-space models has been proposed by Militino et al. (2015). For an overview of hierarchical dynamical spatio-temporal models, the recent book by Cressie and Wikle (2011) provides an excellent starting point for researchers in this area.

In these modeling exercises, the Kalman filter algorithm has proved to be a powerful tool for the statistical treatment of state-space models, providing the estimation of parameters (given in the state vector) and the prediction of unobserved values at a specific location. For instance, Huang and Cressie (1996) and Wikle (2003) developed empirical Bayesian space–time Kalman filter models for the investigation of snow water equivalent and monthly precipitation. Mardia et al. (1998) considered a mixed approach between the Kalman filter algorithm and Kriging methodology (named as kriged Kalman Filter), in which the state equation incorporates different forms of temporal dynamics to model space–time interactions. Wikle and Cressie (1999) presented an approach to space–time prediction that achieves dimension reduction through a spatio-temporal Kalman filter. Xu and Wikle (2007) proposed a spatio-temporal dynamic model formulation with restricted parameter matrices based on prior scientific knowledge, and developed a general expectation–maximization (GEM) algorithm to carry out the estimations. Stroud et al. (2010) applied a dynamic state-space model to a sequence of SeaWiFS satellite images on the Lake Michigan, where a great amount of sediments was observed after a great storm. In this study, the authors implemented a comprehensive version of the Kalman filter, called ensemble Kalman filter, which allows to deal with problems of nonlinearities and high dimensionality inherent in satellite images. They were able to provide maps of concentrations of sediments with uncertainties in space and time. To deal with forecasting on spatio-temporal processes, Zes (2014) used the state-space system and time-varying parameter least squares autoregressive system, with their respective solving algorithms, the Kalman filter, and autoregressive adaptive least squares (ALS). More recently, Bocquet et al. (2015) discussed the methods available for data assimilation in atmospheric models, including ensemble Kalman filter. A common feature in the above-mentioned papers is that the unobserved state vector is responsible of capturing the temporal dependence through a Markovian temporal evolution, with autoregressive or vector autoregressive type of models. This means that the proposed approaches belong to the class of short-memory models in time series data.

Our approach considers a more general framework. Specifically, we allow to capture the temporal dependence of both short- and long-memory processes, as well as modeling the spatial dependence. We use the Kalman filter algorithm for estimation and prediction of spatio-temporal processes, but on the basis of a new updating scheme of the unobserved state vector, which is different from the above-mentioned proposals for the following reasons:

  • We propose the use of an infinite moving average expansion (MA\((\infty )\)) as a form of representing linear processes to deal with spatio-temporal models.

  • Our proposal includes short- or long-memory models to capture the temporal dependence, such as ARMA(pq) and ARFIMA(pdq) models.

  • Instead of directly calculating the likelihood of the spatio-temporal process, we propose an approximation to the likelihood function based on the truncated state-space representation. This, to some extent, reduces the size of memory required and overcomes the computational burdens.

  • Finally, a methodology of imputation of missing observations is also proposed.

The plan of the paper is the following. Section 2 discusses a class of spatio-temporal processes and their representation in MA(\(\infty \)) expansions. Section 3 presents the state-space models and the Kalman filter algorithm for estimating the parameters involved in the temporal dependence and the spatial structure. In Sect. 4, a simulation study reveals the adequacy and the good behavior of our proposal under a variety of practical scenarios. Section 5 applies our proposal to global total column ozone levels and Irish wind speed data. The paper ends with some final conclusions and a discussion.

2 A class of spatio-temporal processes

Consider the class of spatio-temporal processes given by the infinite moving average expansion,

$$\begin{aligned} Y_{t}(\mathbf{s})= & {} {{\varvec{M}}}_{t}(\mathbf{s}){\varvec{\beta }}+ \varepsilon _{t}(\mathbf{s}), \end{aligned}$$
(1)
$$\begin{aligned} \varepsilon _{t}(\mathbf{s})= & {} \sum _{j=0}^\infty \psi _j\eta _{t-j}(\mathbf{s}) , \end{aligned}$$
(2)

for \(t=1,\ldots ,T\), where \(\mathbf{s}\) represents a location in the spatial domain \(D\subset \mathbb {R}^{2}\), \({\varvec{\beta }}=[\beta _1,\ldots , \beta _{p}]^{\top }\) is a vector of parameters \({{\varvec{M}}}_{t}\left( \textstyle \mathbf{s}\right) =[M_{t}^{(1)}(\mathbf{s}),\ldots , M_{t}^{(p)}(\mathbf{s})]\) is a p-dimensional vector of non-stochastic regressors, \(\{\psi _j\}\) is a sequence of coefficients satisfying \(\sum _{j=0}^\infty \psi _j^2<\infty \), and \(\{\eta _t(\mathbf{s})\}\) a sequence of temporally independent and spatially stationary Gaussian processes, with \({{\mathrm{\mathbb {E}}}}(\eta _t(\mathbf{s}))=0 ~ \forall ~ \mathbf{s}\in D\), and covariance function

$$\begin{aligned} {{\mathrm{cov}}}(\eta _{t}(\mathbf{s}), \eta _{t}(\mathbf{s}'))=C^{\eta }(||\mathbf{s}-\mathbf{s}'||; {\varvec{\theta }}), \quad \mathbf{s}, \mathbf{s}' \in D. \end{aligned}$$
(3)

Here, \(C^{\eta }:[0, \infty ) \rightarrow \mathbb {R}\) such that the composition \(C^{\eta } \circ ||\cdot || : \mathbb {R}^{2} \rightarrow \mathbb {R}\) is an isotropic covariance function (Daley and Porcu 2014), where \(||\cdot ||\) is the Euclidean distance, \(\circ \) denotes composition, and \({\varvec{\theta }}\) is a parameter vector. This representation is similar to the well-known MA\((\infty )\) decomposition for the error sequence \(\{\varepsilon _{t}(\mathbf{s})\}\). We say that a stationary process with lag-h temporal covariance \(\kappa (h)\) is said to have short-memory if \(\sum _{h=-\infty }^{\infty } \left| \kappa (h)\right| < \infty \), and in this case the process in (2) will be called a short-memory process. On the other hand, if \(\sum _{h=-\infty }^{\infty } \left| \kappa (h)\right| = \infty \), the process in Eq. (2) will be called a long-memory process. Another characterization is based directly on the MA\((\infty )\) decomposition of the process (2). It is said that the process \(\{\varepsilon _{t}(\mathbf{s})\}\) has short-memory if \(\psi _j \sim \exp (-a j)\) for \(j\ge 1\), with a a positive constant. On the other hand, the process has long-memory if \(\psi _j \sim j^{d-1}\) for some \(d\in (0,1/2)\) [see Palma et al. (2013)]. Here, \(\sim \) means that the ratio of both sides tends to one. We now consider the issue of the spatial and temporal dependencies of the process \(Y_{t}(\mathbf{s})\). In the case of the MA\((\infty )\) decomposition for the spatio-temporal process defined by (1), the covariance across both space and time is given by

$$\begin{aligned} {{\mathrm{cov}}}\left( Y_{t}(\mathbf{s}), Y_{t'}(\mathbf{s}') \right)= & {} \sum _{i,j=0}^{\infty }\psi _{j}\psi _{i}{{\mathrm{cov}}}\left( \eta _{t-j}(\mathbf{s}),\eta _{t'-i}(\mathbf{s}')\right) \\= & {} C^{\eta }(\xi )\sum _{j=0}^{\infty } \psi _{j}\psi _{j+\left| t-t'\right| }=C^{\eta }(\xi )\kappa (h), \end{aligned}$$

where \(\xi =||\mathbf{s}-\mathbf{s}'||, \mathbf{s}, \mathbf{s}' \in D\), \(h=t-t', ~ t, t' \in \mathbb {Z_+}\) and \(\kappa (\cdot )\) is a temporal covariance function. Thus, the spatio-temporal covariance function of the process \(Y_{t}(\mathbf{s})\) can be written as the product of a purely spatial with a purely temporal covariance function. In such a case, we say that \(Y_{t}(\mathbf{s})\) has a separable spatio-temporal covariance function (Gneiting 2002). For a neater and self-contained exposition, we discuss some examples below.

Example 1

As an example of short-memory process, we consider the regression model with autoregressive moving average errors, denoted by ARMA(pq), and defined as

$$\begin{aligned} \varepsilon _{t}(\mathbf{s})= \sum _{j=1}^{p} \phi _j\varepsilon _{t-j}(\mathbf{s}) + \sum _{j=1}^{q} \theta _j \eta _{t-j}(\mathbf{s}) + \eta _{t}(\mathbf{s}), \quad \mathbf{s}\in D, \end{aligned}$$

for \(t=1,2\ldots , T\). When \(p=q=1\), i.e., \(\varepsilon _{t}(\mathbf{s})= \phi \varepsilon _{t-1}(\mathbf{s}) + \theta \eta _{t-1}(\mathbf{s}) + \eta _{t}(\mathbf{s})\), we obtain a special case of the errors in (2) with coefficients \(\psi _j=\left( \phi -\theta \right) \phi ^{j-1}\) for \(j\ge 1\) and \(\psi _j=1\) for \(j=0\) in the MA\((\infty )\) process (hereafter, we assume that \(\psi _0=1\), unless specified otherwise) . When \(q=0\), we have an autoregressive AR(p) process defined as \(\varepsilon _{t}(\mathbf{s})= \sum _{j=1}^{p} \phi _j \varepsilon _{t-j}(\mathbf{s}) + \eta _{t}(\mathbf{s}),\) for \(t=1,2\ldots , T\). If \(p=1\), then \(\varepsilon _{t}(\mathbf{s})= \phi \varepsilon _{t-1}(\mathbf{s}) + \eta _{t}(\mathbf{s})\), this process is a special case of the errors in (2) with coefficients \(\psi _j=\phi ^j\) in the MA\((\infty )\) process. Straightforward calculations show that the temporal covariance function from an AR(1) process is \(\kappa (h)= \frac{\phi ^{h}}{1-\phi ^2}\) for \(h>0\), see Mikosch et al. (1995) for other MA\((\infty )\) representations on ARMA models.

Example 2

Other example of the regression model (1) is the stationary fractional noise (FN) errors (2) with infinite moving average coefficients \(\psi _{j}=\frac{\varGamma (j+d)}{\varGamma (j+1)\varGamma (d)}\), where \(\varGamma (\cdot )\) is the Gamma function and d is the long-memory coefficient. The expected value is given by \({{\mathrm{\mathbb {E}}}}(Y_{t}(\mathbf{s}))= {{\varvec{M}}}_{t}(\mathbf{s}){\varvec{\beta }}\), while the temporal covariance function is \(\kappa (h)=\sigma ^2\frac{\varGamma (1-2d)\varGamma (h+d)}{\varGamma (1-d) \varGamma (d)\varGamma (h+1-d)}\), for \(h>0\), with \(d\in (0,1/2)\). A natural extension of the FN model is the stationary autoregressive fractionally integrated moving average ARFIMA(pdq) process, defined by \(\varPhi \left( B\right) \varepsilon _{t}(\mathbf{s})=\varTheta \left( B\right) \left( 1-B \right) ^{-d}\eta _{t}(\mathbf{s})\), for \(t=1,2,\ldots T\), where B is the backward shift operator, \(\varPhi \left( B\right) =1+\phi _{1}B+\cdots +\phi _{p}B^{p}\) is an autoregressive polynomial, and \(\varTheta \left( B \right) =1+\theta _1B+\cdots +\theta _{q}B^{q}\) is a moving average polynomial. The infinite moving average coefficients \(\psi _{j}\) satisfy the following relation

$$\begin{aligned} \psi _0=1, \quad \psi _j= \pi _j(d) + \sum _{i=1}^{\min (j,p)} \phi _i \psi _{j-i} + \sum _{i=1}^{\min (j,q)} \theta _i \pi _{j-i}(d), \quad j \ge 1, \end{aligned}$$

where the weights \(\pi _j(d)\) are given by \(\pi _{j}(d)=\frac{\varGamma (j+d)}{\varGamma (j+1)\varGamma (d)}\), for \(0<d <1/2\); see Kokoszka and Taqqu (1995) for more details.

For the spatial covariance structure defined in (3), we consider the general class of Matérn covariance models (Matérn 1986) given by

$$\begin{aligned} C^{\eta }(\xi ; {\varvec{\theta }})=\frac{\sigma ^2}{2^{\nu -1}\varGamma {(\nu )}}\left( 2\sqrt{\nu } \rho \xi \right) ^{\nu } K_{\nu }\left( 2\sqrt{\nu } \rho \xi \right) , \quad \xi \ge 0, \quad {\varvec{\theta }}=(\sigma ^{2}, \rho , \nu )^{\top },\qquad \end{aligned}$$
(4)

where \(\rho >0\), \(\nu \ge 0\), \(\sigma ^2 >0\) and \(K_{\nu }\) is the modified Bessel function of the second kind of order \(\nu \). Known special cases will be shown in detail in Sect. 4.

3 State-space representation

Before starting with the state-space (SS throughout) representation defined through Eq. (1), a general version of the SS system for spatio-temporal processes is given by

(5)

where \(Y_{t}(\cdot )\) is the observation for time t at location \(\mathbf{s}\in D\), \(G_{t}(\mathbf{s})\) is an observation operator, \({{\varvec{M}}}_{t}(\mathbf{s})\) is a vector of exogenous or predetermined variables, \([X_{t}(\mathbf{s}) \quad {\varvec{\beta }}_t(\mathbf{s})]^{\top }\) is a state vector, and \(W_{t}(\mathbf{s})\) is an observation noise with variance \({R_{W}}\). In addition, \(F_{t}(\cdot )\) is a state transition operator, \(I_r\) denotes the \(r\times r\) identity matrix hereafter, H is a linear operator, \(V_{t}(\mathbf{s})\) is a spatially colored, temporally white and Gaussian with mean zero and covariance function \({{\mathrm{cov}}}(V_{t}(\mathbf{s}), V_{t}(\mathbf{s}')) =C^{V}(\xi ; {\varvec{\theta }})\), and \(V_{t}(\mathbf{s})\) is uncorrelated with \(W_{t}(\mathbf{s})\), i.e., \({{\mathrm{\mathbb {E}}}}(W_{t}(\mathbf{s})V_{t'}(\mathbf{s}))=0\) for all \(\mathbf{s}\in D\) and for all \(t, t' \in \mathbb {Z_+}\).

The process (1) can be represented by a SS system as above by generalizing the infinite-dimensional equations given by Hannan and Deistler (1988) and Ferreira et al. (2013) to the spatio-temporal case. This can be achieved by assignation of the \((j + 1)\) component of the state vector as the lag in j steps of the process \(\{\eta _{t}(\mathbf{s})\}\) defined in the model error \(\{\varepsilon _{t}(\mathbf{s})\}\) , i.e., \(X_{t}^{(j+1)} (\mathbf{s})=\eta _{t-j}(\mathbf{s})\) for \(j=0,1,\ldots \). In such a case, the process specified by (1) can be represented by the following infinite-dimensional state-space system

(6)

for \(t=1,\ldots ,T\), where \(G= \left[ \begin{array}{ccccccc} 1 &{}\quad \psi _1 &{}\quad \psi _2&{}\quad \ldots &{} \\ \end{array}\right] \), , \(I_{\infty }=\text {diag}\{1,1,\ldots \}\), and \({R_{W}}=0\), i.e., we assume that our observed data are measured without an additive error. From these equations, observe that \(H=[1,0, 0, \ldots ]^{\top }\) and \(V_{t}(\mathbf{s})=\eta _{t+1}(\mathbf{s})\). Estimation techniques related to (6) have a notorious computational burden; for this reason, we truncate the expansion in (2) after some \(m \in \mathbb {Z_{+}}\) components, so that an approximation for \(\{Y_{t}(\mathbf{s})\}\) can be written as

$$\begin{aligned} Y_{t}(\mathbf{s}) = {{\varvec{M}}}_{t}(\mathbf{s}){\varvec{\beta }}+ \sum _{j=0}^{m} \psi _j \eta _{t-j}(\mathbf{s}), \quad \mathbf{s}\in D, \end{aligned}$$
(7)

for some positive integer m. Thus, the finite-dimensional SS representation of model (7) is considered, with observation and state equations given by

(8)

where . It is worth noting that the matrices involved in the truncated Kalman equations have the following dimensions. For \(\mathbb {M}_{\mathbb {R}(p \times q)}\) being the space of real-value matrices of dimension \(p \times q\), we have \(G \in \mathbb {M}_{\mathbb {R}(1 \times (m+1))}\), \(X_{t}(\mathbf{s})\in \mathbb {M}_{\mathbb {R}((m+1)\times 1)}\), \(F \in \mathbb {M}_{\mathbb {R}((m+1)\times (m+1))}\) and \(H \in \mathbb {M}_{\mathbb {R}((m+1)\times 1)}\). The following result establishes the asymptotic magnitude of the truncation error when approximating (1) by (7).

Proposition 1

For given \(m\in \mathbb {Z_{+}}\), \(t\in \mathbb {Z}\) and \(\underline{\mathbf{s}}=(\mathbf{s}_1, \mathbf{s}_2, \ldots , \mathbf{s}_{n})^{\top }\). Let \({\varvec{\eta }}_{t}(\underline{\mathbf{s}}) = (\eta _{t}(s_1), \ldots , \eta _{t}(s_n))^{\top }\) and \(r_{m}(\underline{\mathbf{s}}) : D \rightarrow \mathbb {M}_{\mathbb {R}(n \times n)}\) be the error variance matrix, defined as \(r_{m}(\underline{\mathbf{s}})={{\mathrm{Var}}}(\sum _{j=0}^\infty \psi _j{\varvec{\eta }}_{t-j}(\underline{\mathbf{s}})- \sum _{j=0}^{m} \psi _j {\varvec{\eta }}_{t-j}(\underline{\mathbf{s}}))\). Then, for large mn, \(a>0\) and \(0<d<1/2\), we have the following:

$$\begin{aligned} ||r_m(\underline{\mathbf{s}})|| \sim \left\{ \begin{array}{ll} \mathcal {O}( n m^{2d-1}), &{}\quad \mathrm{for \,a\, long-memory\, process} \\ \mathcal {O}( n \exp {(-am)}), &{}\quad \mathrm{for \,a\, short-memory\, process}, \\ \end{array} \right. \end{aligned}$$

where \(||A||= \max _{1\le i \le n}\sum _{j=1}^{n} \mid a_{ij} \mid \) is the infinity norm of A.

Proof

For an infinite moving average process \(\{Y_{t}(\underline{\mathbf{s}})\}\), the error variance matrix for the truncated series is

$$\begin{aligned} r_m(\underline{\mathbf{s}})= & {} {{\mathrm{Var}}}\left( \sum _{j=0}^{\infty } \psi _j{\varvec{\eta }}_{t-j}(\underline{\mathbf{s}}) - \sum _{j=0}^{m} \psi _j{\varvec{\eta }}_{t-j} (\underline{\mathbf{s}}) \right) ={{\mathrm{Var}}}\left( \sum _{k=m+1}^{\infty }\psi _k {\varvec{\eta }}_{t-k}(\underline{\mathbf{s}})\right) \\= & {} \left[ C^{\eta }(||\mathbf{s}_i-\mathbf{s}_j||)\right] _{i,j=1}^{n}\sum _{k=m+1}^{\infty }\psi _k^{2} = \mathbf{{c}}_n b_{m}, \end{aligned}$$

where the coefficients satisfy \(\psi _{m}\sim \frac{\varTheta (1) }{\varPhi (1)} \frac{m^{ d-1}}{\varGamma (d)}\), as \(m \rightarrow \infty \) and where d is the long-memory parameter, see Corollary 3.1 of Kokoszka and Taqqu (1995). In particular, if \(\{Y_{t}(\mathbf{s})\}\) is a FN(d) process with \(\psi _{m}=\frac{\varGamma (m+d)}{\varGamma (m+1)\varGamma (d)}\), by applying the Stirling’s approximation we have \(\psi _{m} \sim \frac{m^{d-1}}{\varGamma (d)}\), \(m\rightarrow \infty \). Furthermore, using Lemma 3.3 of Palma (2007) we get

$$\begin{aligned} \sum _{k=m+1}^{\infty }\psi _k^{2} \sim m^{2d-1} \varGamma (d)^{-1} (2d-1)^{-1}, \quad 0<d < 1/2, \end{aligned}$$
(9)

so that (9) implies \(b_m= \mathcal {O}(m^{2d-1})\) when \(m \rightarrow \infty \). On the other hand, \(||\mathbf{{c}}_n||_{\infty } = \max _{1\le i \le n}\sum _{j=1}^{n} \mid C^{\eta }(||\mathbf{s}_i-\mathbf{s}_j||)\mid \le n\sigma ^2 \) where \(\sigma ^2=C^{\eta }(0; {\varvec{\theta }})\). Thus, \(\mathbf{{c}}_n = \mathcal {O}(n)\) for large n, proving the result. The proof for the case of a short-memory process is similar to the previous case, using \(\psi _{m}\sim \exp {(-am)}\) when \( m \rightarrow \infty \).

Some comments are in order. First, space–time asymptotics is still an open problem in the geostatistical literature. Most of the contributions are devoted to either of these two approaches: on the one hand, one might fix the number of spatial sites and let the number of temporal observations tend to infinity. This is the approach followed, for instance, by Li et al. (2008). According to this framework, evaluating the performance of our estimators amounts to apply mutandis the results in Chan and Palma (1998). Another approach might be obtained by increasing domain asymptotics (Guyon 1995). In this case, space–time asymptotics would be covered by the results in Guyon (1995). A very tempting approach might be to consider infill asymptotics in space and increasing domain in time. To the knowledge of the authors, this approach is not available in the literature and represents a major challenge.

3.1 Derivation of the Kalman filter algorithm

We now develop the Kalman filter algorithm for the spatio-temporal process defined in Eq. (7) with the associated SS representation given in (8). The Kalman filter is a powerful tool to make inferences about the state vector, which allows to calculate the conditional mean and covariance matrix of the state vector \(\left[ X_{t}(\mathbf{s}), {\varvec{\beta }}(\mathbf{s})\right] ^{\top }\). For the sake of simplicity, we restrict our attention to studying the behavior of the parameter estimates of the error \(\{\varepsilon _{t}(\cdot )\}\) process in the regression model (1), i.e.,

$$\begin{aligned} Y_{t}(\mathbf{s}) = \sum _{j=0}^{m}\psi _j\eta _{t-j}(\mathbf{s}), \quad \mathbf{s}\in D, \end{aligned}$$
(10)

in such a case, the state vector is reduced down to \(X_{t}(\mathbf{s})\). The Kalman filter recursive equations are well known, but we present them here to setup notation, and for a self-contained exposition. First, define the \(n\times 1\) vector \(\mathbf{{Y}}_t=[Y_t(\mathbf{s}_1), \ldots , Y_{t}(\mathbf{s}_n)]^{\top }\) containing the data values at n spatial locations, \(\{\mathbf{s}_i\}_{i=1}^{n}\), at time t and let \(\mathbf{{X}}_{t}=[X_t(\mathbf{s}_1), \ldots , X_{t}(\mathbf{s}_n)]^{\top }\) be an \(n \times 1\) vector for an unobservable spatio-temporal state process for n locations, where each component of this state vector are \((m+1)\)-dimensional vectors. Let \(\widehat{X}_{t}(\mathbf{s})= {{\mathrm{\mathbb {E}}}}(X_{t}(\mathbf{s})| \mathbf{{X}}_1,\ldots , \mathbf{{X}}_t )\) the best linear unbiased predictor (BLUP) of the unobserved state \(X_{t}(\mathbf{s})\) and let \(\varOmega _{t}(\mathbf{s},\mathbf{s}') = {{\mathrm{cov}}}(X_{t}(\mathbf{s})-\widehat{X}_{t}(\mathbf{s}), X_{t}(\mathbf{s}')-\widehat{X}_{t}(\mathbf{s}') )\) be the state prediction error variance-covariance matrix. Finally, the initial state vector has mean \(\widehat{X}_1(\mathbf{s})={{\mathrm{\mathbb {E}}}}([\eta _0(\mathbf{s}), \eta _{-1}(\mathbf{s}), \ldots , \eta _{1-m}(\mathbf{s})]^{\top })=\mathbf{{0}}_{((m+1) \times 1)}\) and covariance matrix

$$\begin{aligned} \varOmega _{1}(\mathbf{s},\mathbf{s}')= & {} {{\mathrm{cov}}}([\eta _0(\mathbf{s}), \eta _{-1}(\mathbf{s}), \ldots , \eta _{1-m}(\mathbf{s})]^{\top }, [\eta _0(\mathbf{s}'), \eta _{-1}(\mathbf{s}'), \ldots , \eta _{1-m}(\mathbf{s}')]^{\top }) \\= & {} \text {diag}({{\mathrm{cov}}}(\eta _0(\mathbf{s}), \eta _0(\mathbf{s}')), \ldots , {{\mathrm{cov}}}(\eta _{1-m}(\mathbf{s}), \eta _{1-m}(\mathbf{s}'))), \end{aligned}$$

which is a \((m+1) \times ( m+1)\) matrix. The Kalman filter allows to estimate the state vector, \({X}_{t+1}(\mathbf{s})\) for \(\mathbf{s}\in D\) and its prediction error based on the information available at time t. These estimates are given by

$$\begin{aligned} {\widehat{X}}_{t+1}(\mathbf{s})&= F{\widehat{X}}_{t}(\mathbf{s}) + \varTheta _{t}(\mathbf{s}) {\varvec{\Delta }}_{t}^{-1} (\mathbf{Y}_{\mathbf{t}} - \widehat{\mathbf{Y}}_{bf t}), \end{aligned}$$
(11a)
$$\begin{aligned} \varOmega _{t+1}(\mathbf{s}, \mathbf{s}')&= F\varOmega _{t}(\mathbf{s},\mathbf{s}') F^{\top } + C^{HV}(\mathbf{s},\mathbf{s}') - \varTheta ^{\top }_{t}(\mathbf{s}) {\varvec{\Delta }}_{t}^{-1}\varTheta _{t}(\mathbf{s}'), \end{aligned}$$
(11b)

where

$$\begin{aligned} {\varvec{\Delta }}_{t}= & {} {{\mathrm{{Var}}}}(\mathbf {Y}_{t}-\widehat{\mathbf{Y}}_{\mathbf{t}}),\\= & {} {{\mathrm{{Var}}}}( G ({\mathbf{X}}_t - \widehat{\mathbf{X}}_{t})),\\= & {} \left[ \begin{array}{cccccc} G \varOmega _{t}(\mathbf{s}_1,\mathbf{s}_1) G^{\top } &{}\quad \cdots &{}\quad G \varOmega _{t}(\mathbf{s}_1,\mathbf{s}_n) G^{\top } \\ \vdots &{}\quad \ddots &{}\quad \vdots \\ G \varOmega _{t}(\mathbf{s}_1,\mathbf{s}_n) G^{\top } &{}\quad \cdots &{}\quad G \varOmega _{t}(\mathbf{s}_n,\mathbf{s}_n) G^{\top }\\ \end{array} \right] ,\\ \varTheta ^{\top }_{t}(\mathbf{s})= & {} {{\mathrm{cov}}}(X_{t+1}(\mathbf{s}), \mathbf{{Y}}_{t}-\widehat{\mathbf{Y}}_{t} ), \\= & {} \left[ \begin{array}{cc} {{\mathrm{cov}}}(X_{t+1}(\mathbf{s}), Y_{t}(\mathbf{s}_1)-\widehat{{Y}}_{t}(\mathbf{s}_1) ) \\ \vdots \\ {{\mathrm{cov}}}(X_{t+1}(\mathbf{s}), Y_{t}(\mathbf{s}_n)-\widehat{{Y}}_{t}(\mathbf{s}_n) ) \end{array} \right] ^{\top } = \left[ \begin{array}{cc} F \varOmega _{t}(\mathbf{s},\mathbf{s}_1) G^{\top } \\ \vdots \\ F \varOmega _{t}(\mathbf{s},\mathbf{s}_n) G^{\top } \end{array} \right] ^{\top },\\ \widehat{Y}_{t}(\mathbf{s})= & {} {{\mathrm{\mathbb {E}}}}(Y_{t}(\mathbf{s})| \mathbf {Y}_1,\ldots , \mathbf {Y}_t )= G \widehat{X}_{t}(\mathbf{s}),\\ C^{HV}(\mathbf{s},\mathbf{s}')= & {} {{\mathrm{cov}}}(H V_{t}(\mathbf{s}), H V_{t}(\mathbf{s}') )= H C^{V}(\xi )H^{\top }= \left( \begin{array}{cccc} C^{\eta }(\xi ) &{}\quad \mathbf{0} \\ \mathbf {0} &{}\quad \mathbf{0} \\ \end{array} \right) , \end{aligned}$$

where \(\varTheta ^{\top }_{t}(\mathbf{s})\in \mathbb {M}_{\mathbb {R}((m+1) \times n)}\) and \(C^{HV}(\mathbf{s},\mathbf{s}') \in \mathbb {M}_{\mathbb {R}((m+1)\times ( m+1))}\). Let \({\varvec{\theta }}\) be a parameter vector specifying model (10), then the log-likelihood function \(\mathcal {L}(\cdot )\) (up to a constant), can be obtained from (11),

$$\begin{aligned} \mathcal {L}({\varvec{\theta }}) = -\frac{1}{2}\sum _{t=1}^{T}\log \left| {\varvec{\Delta }}_{t}({\varvec{\theta }}) \right| + {{\varvec{\epsilon }}}_{t}({\varvec{\theta }}) ^{\top }{{\varvec{\Delta }}_{t}({\varvec{\theta }})}^{-1}{{\varvec{\epsilon }}}_{t}({\varvec{\theta }}), \end{aligned}$$

where \({\varvec{\epsilon }}_{t}({\varvec{\theta }}) = \left( \mathbf{{Y}}_{t}- {{\widehat{\mathbf{Y}}}}_{t}\right) \) is the innovation vector, and \({\varvec{\Delta }}_{t}({\varvec{\theta }})\) is the innovation covariance matrix at time t obtained using the parameter value \({\varvec{\theta }}\). Hence, the approximate maximum likelihood estimates (MLE) provided by the Kalman equations (11) is given by \(\widehat{{{\varvec{\theta }}}} = \arg \max _{{{{\varvec{\theta }}}} \in {\varvec{\varTheta }}} \mathcal {L}({\varvec{\theta }})\), where \({\varvec{\varTheta }}\) is a parameter space. Note that the Kalman equations (11) can be applied directly to the general state-space representation (5), yielding in this case an exact MLE. In order to obtain predictions of unobserved values at a location \(\mathbf{s}_0\), we define the best linear predictor as

$$\begin{aligned} \widehat{Y}_{T+k}(\mathbf{s}_0)= {{\mathrm{\mathbb {E}}}}(Y_{T+k}(\mathbf{s}_0) | \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T ), \quad \mathbf{s}_{0}\in D, \end{aligned}$$
(12)

which is the k-step-out-sample predictor based on the finite past for \(k=1, \ldots , K\). These forecasts and their mean squared prediction error are obtained from the Kalman recursive equations given by (11), as follows

$$\begin{aligned} \widehat{Y}_{T+k}(\mathbf{s}_0)&={{\mathrm{\mathbb {E}}}}(Y_{T+k}(\mathbf{s}_0)| \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T)= G {{\mathrm{\mathbb {E}}}}(X_{T+k}(\mathbf{s}_0)| \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T) \nonumber \\&= GF{{\mathrm{\mathbb {E}}}}(X_{T+k-1}(\mathbf{s}_0)| \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T)\nonumber \\&\vdots \nonumber \\&= GF^{k-1}{{\mathrm{\mathbb {E}}}}(X_{T+1}(\mathbf{s}_0)| \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T)= G F^{k}\mathrm {\widehat{X}}_{T} (\mathbf{s}_0). \end{aligned}$$
(13)

Additionally, its mean squared prediction error is calculated using the following recursive relations

$$\begin{aligned} \varOmega _{T+1}(\mathbf{s}_0, \mathbf{s}_0)&= F\varOmega _{T}(\mathbf{s}_0, \mathbf{s}_0)F^{\top } + C^{HV} (\mathbf{s}_0,\mathbf{s}_0) \\ \varOmega _{T+2}(\mathbf{s}_0, \mathbf{s}_0)&= F^{2}\varOmega _{T}(\mathbf{s}_0, \mathbf{s}_0) (F^{\top })^2 + F C^{HV}(\mathbf{s}_0,\mathbf{s}_0) (F)^{\top } + C^{HV}(\mathbf{s}_0,\mathbf{s}_0)\\&\vdots \\ \varOmega _{T+k}(\mathbf{s}_0, \mathbf{s}_0)&= F^{k}\varOmega _{T}(\mathbf{s}_0, \mathbf{s}_0)(F^{\top })^k + \sum _{j=0}^{k-1} F^{j} C^{HV}(\mathbf{s}_0,\mathbf{s}_0) (F^{\top })^{j}. \end{aligned}$$

Furthermore, the prediction error variance \({\varvec{\Delta }}_{T+k}(\mathbf{s}_0,\mathbf{s}_0)\) satisfies

$$\begin{aligned} {\varvec{\Delta }}_{T+k}(\mathbf{s}_0,\mathbf{s}_0)&= {{\mathrm{{Var}}}}(Y_{T+k}(\mathbf{s}_0)-\widehat{Y}_{T+k}(\mathbf{s}_0)| \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T) \nonumber \\&= {{\mathrm{{Var}}}}(G X_{T+k}(\mathbf{s}_0) + W_{T+k}(\mathbf{s}_0)- G \widehat{X}_{T+k} | \mathbf{{Y}}_1,\ldots , \mathbf{{Y}}_T )\nonumber \\&= G \varOmega _{T+k}(\mathbf{s}_0, \mathbf{s}_0)G^{\top }\nonumber \\&= GF^{k}\varOmega _{T}(\mathbf{s}_0, \mathbf{s}_0) (F^{\top })^kG^{\top } + G\sum _{j=0}^{k-1} F^{j} C^{HV}(\mathbf{s}_0,\mathbf{s}_0) (F^{\top })^{j}G^{\top }. \end{aligned}$$
(14)

3.2 Missing observations

The analysis of missing observations in time series is an issue that has been treated by several authors, see Harvey (1989) and Durbin and Koopman (2012), among others. The SS method and its associated Kalman filter algorithm provides a simple methodology for handling missing values.

In order to describe this procedure, we assume that for the set of missing observations \(Y_{t}(\mathbf{s})\) for \(t=n+1, \ldots , T-n\), the vector \({\varvec{\epsilon }}_{t}(\mathbf{s})\) and the matrix \(\varTheta _{t}(\mathbf{s})\) of the Kalman filter are set to zero, and the Kalman updates become

$$\begin{aligned} \mathrm {\widehat{X}}_{t+1}(\mathbf{s})= & {} F\mathrm {X}_{t}(\mathbf{s}) ,\\ \varOmega _{t+1}(\mathbf{s},\mathbf{s}')= & {} F\varOmega _{t}(\mathbf{s},\mathbf{s}')F^{\top } + C^{HV}(\mathbf{s},\mathbf{s}'), \quad \mathbf{s}, \mathbf{s}' \in D, \end{aligned}$$

for \(t=n+1, \ldots , T-n\). This imputation procedure provides an alternative method to the forecasts of \(Y_{T+k}(\mathbf{s})\) together with their forecast error, and this can be obtained merely by treating \(Y_{t}(\mathbf{s})\) for \(t=T+1, \ldots , T+k\) as missing observations, and continuing the Kalman filter beyond \(t=T\) with \({\varvec{\epsilon }}_{t}(\mathbf{s})=0\) and \(\varTheta _{t}(\mathbf{s})=0\) for \(t>T\). This procedure for forecasting is an elegant feature of SS methods for time series analysis.

4 Simulation studies

For the simulation studies, we used R free software (Core Team 2017) and C subroutines connected to R (Peng and de Leeuw 2002) through the interface called .C. The numerical optimization of the Gaussian log-likelihood function to obtain the QML estimates was carried out using the nlminb command of R. This method makes use of the subroutine “BFGS” corresponding to a quasi-Newton method (Broyden 1969; Fletcher 1970; Goldfarb 1970). We used nlminb because this optimizer can be implemented even when the sample size is small, obtaining convergence in the optimization process. In addition, it is less sensitive to initial values compared to other optimizers. We use Monte Carlo experiments to analyze the finite sample behavior of the Kalman filter estimator, for both short- and long-memory spatio-temporal processes, as detailed through Sect. 2. In particular, we consider two models, the first is an ARMA(1, 1), a short-memory case, and the second is an ARFIMA, a long-memory case.

4.1 Short-memory case

Consider an ARMA(1, 1) model for the errors defined by (1) with

$$\begin{aligned} \varepsilon _{t}(\mathbf{s})= \phi \varepsilon _{t-1}(\mathbf{s}) + \theta \eta _{t-1}(\mathbf{s}) + \eta _{t}(\mathbf{s}), \quad \psi _j=\left( \phi -\theta \right) \phi ^{j-1} \quad \text{ for } \quad j\ge 1, \end{aligned}$$
(15)

where \(\{\eta _{t}(\cdot )\}\) are independent over time, and follow a stationary, zero mean Gaussian spatial random process and covariance function given by Eq. (3). We consider the general class of Matérn covariance models defined by Eq. (4). In particular, we use two types of covariance functions generating two possible models:

  • Model 1 \(\nu =\frac{1}{2}\), corresponding to the exponential model

    $$\begin{aligned} C^{\eta }(\xi ; {\varvec{\theta }})=\sigma ^2 \exp \{-\rho \xi \}, \quad \xi \ge 0, \quad {\varvec{\theta }}=(\sigma ^2, \rho , 1/2)^{\top }, \quad \text{ and } \end{aligned}$$
  • Model 2 \(\nu =\frac{3}{2}\), which leads to

    $$\begin{aligned} C^{\eta }(\xi ; {\varvec{\theta }})=\sigma ^2\left( 1+ \rho \xi \right) \exp \{-\rho \xi \} , \quad \xi \ge 0, \quad {\varvec{\theta }}=(\sigma ^2, \rho , 3/2)^{\top }. \end{aligned}$$

The choice of \(\nu \) in \(C^{\eta }\) affects the mean square (m.s.) differentiability of the associate random field. For Model 1, the associated Gaussian random field will be a.s. continuous but no m.s. differentiable. For Model 2, it will be m.s. differentiable. We assume to observe a spatio-temporal process \(\{\varepsilon _{t}({\mathbf{s}}_{i}): i=1, \ldots ,n; t=1, \ldots , T\}\), on a regular, rectangular grid of \(n\times n=N\) spatial locations in \([1,n]^2\), and at equidistant time points. For the data generation scheme, the process is generated recursively from (15) with initial values \(\eta _{1}(\mathbf{s}) \sim N(0, C^{\eta }(0))\) and \(\varepsilon _{1}(\mathbf{s})= C^{\eta }(0)/(1-\phi ^2) + \eta _{1}(\mathbf{s})\). Each realization is of length \(T=100, 250\). In addition, the parameters of the covariance function are considered constant, with different values for \(\sigma ^2\) and \(\rho \). Finally, we simulate each process 100 times, and for each simulation the Kalman filter estimates are evaluated by the relative bias (RelBias) and by the mean square error (MSE) defined as \(\text{ RelBias }({\varvec{\theta }})=\frac{1}{100}\sum _{i=1}^{100}\left( \widehat{\theta }_{i}/\theta _{i} - 1\right) \quad \text {and} \quad \text{ MSE }({\varvec{\theta }})=\frac{1}{100}\sum _{i=1}^{100}\left( \widehat{\theta }_{i}-\theta _{i}\right) ^{2}\), where \(\widehat{\theta }_{i}\) is the Kalman filter estimate of \(\theta _i\) for the ith realization. A preliminary assessment of the SS approach is related to the choice of the truncation level, which has an influence on the parameter estimates. Figure 1 plots the estimated MSE for an AR(1) model for different values of \(\phi \) and \(\rho \) as a function of the truncation level, m, for the MA\((\infty )\) decomposition. Figure 2 displays the MSE for different values of d and \(\rho \) as a function of the truncation level, m, based on a FN(d) model. In both cases, we use the Model 1 as spatial covariance function with \(\sigma ^2=1\), \(T=250\) and \(N=100\) spatial locations. In these graphs, darker regions represent the minimal empirical MSE, while lighter regions indicate greater MSE values. Note that, for the long-memory case an improvement in terms of MSE is evidenced when the truncation parameter is \(m \ge 10\). Similar results are obtained in Chan and Palma (1998) with \(m> 6\) when considering only the temporal domain. The short-memory case requires a lower level of truncation (\(m=5\)) on the state-space representation to guarantee the efficient performance of the truncated MLE estimates. In light of this evidence and due to space constraints, only a subset (\(m=5,10\)) of the results are presented; however, other results and codes are available from the authors upon request.

Fig. 1
figure 1

MSE as a function of m and \((\phi , \rho )\) for the MA approximation of an AR(1) spatio-temporal process with covariance function following an exponential model. In a the empirical MSE of \(\phi \). In b the empirical MSE of the covariance parameter \(\rho \)

Fig. 2
figure 2

MSE as a function of m and \((d, \rho )\) for the MA approximation of a FN(d) spatio-temporal process with covariance function following an exponential model . In a empirical MSE of the long-memory parameter d. In b empirical MSE of the covariance parameter \(\rho \)

Table 1 shows the estimates of the parameters for two truncation levels, \(m = 5, 10\) and \(N=100\) spatial locations. We have used a combination of parameter values that reproduce widely encountered situations in practical analysis. These scenarios are shown in Table 1. Note that the estimates are very close to their theoretical counterparts. Furthermore, it is noteworthy that goodness-of-fit criteria such as the standard deviation (SD), bias and \(\sqrt{{\hbox {MSE}}}\) are very similar in relation to the level of truncation. In effect, if the truncation parameter is \(m=5\), the truncated Kalman filter works extremely well for both sample sizes.

4.2 Long-memory case

Consider now the following stationary ARFIMA(0, d, 1) model defined by

$$\begin{aligned} \varepsilon _{t}(\mathbf{s})= (1-\theta B ) (1-B)^{-d}\eta _{t}(\mathbf{s}), \quad \psi _j= \frac{\varGamma (j+d)}{\varGamma (j+1)\varGamma (d)} + \theta \frac{\varGamma (j+d-1)}{\varGamma (j)\varGamma (d)}, \end{aligned}$$

for \(j\ge 1\), where \(\varGamma (\cdot )\) is the Gamma function, d is the long-memory coefficient such that \(0<d<1/2\) and \(\theta \) is a moving average coefficient satisfying \(|\theta | < 1\). Concerning the innovations \(\{\eta _{t}(\mathbf{s})\}\), these were generated considering the same spatial structure as for the short-memory case. The samples from this ARFIMA process are generated using the innovation algorithm; see Brockwell and Davis (1991), page 172. In this implementation, the temporal covariance of the process \(\{\varepsilon _{t}(\mathbf{s})\}\) is given by

$$\begin{aligned} \kappa _{{T}}(h)=\frac{\varGamma (1-2d)\varGamma (h+d)}{\varGamma (1-d) \varGamma (d)\varGamma (h+1-d)} \times \left[ 1 + \theta ^2 - \theta \frac{h-d}{h-1+d} - \theta \frac{h+d}{h+1-d} \right] , \end{aligned}$$

for \(h>0\). We offer some practical combination of parameter values commonly encountered in data set analysis with long-range dependence. Table 2 reports the results from the Monte Carlo simulations for several parameter values and two truncation levels, \(m=5,10\). The simulations are based on sample sizes \(T=100, 250\) and 100 replications. As indicated in the previous case, the observed means for the estimates are close to their expected values. In contrast to the findings in the short-memory case, a long-memory process requires a higher level of truncation on the state-space representation to guarantee the efficient performance of the truncated MLE estimates.

Table 1 Results for the Kalman filter estimates for an ARMA(1, 1) model with observed locations on the square \([0,10]^2\)
Table 2 Results for the Kalman filter estimates for an ARFIMA(0, d, 1) model with observed locations on the square \([0,10]^2\)

Another important aspect that should be assessed is the out-of-sample predictive ability of our methodology. To this end, we simulated a dataset under ARMA(1, 1) and ARFIMA(0, d, 1) models, and for both cases we used the Model 1 as spatial covariance structure. For each simulated dataset, we fitted three models, giving rise to three cases of comparison, namely AR(1), ARMA(1, 1) and ARFIMA(0, d, 1). We used cross-validation techniques to compare these three cases. Cross-validation is implemented on the observed data with K time points deleted for each location \(\mathbf{s}\), i.e., \(\{Y_{t}(\mathbf{s}_i): t=1,\ldots , T-K ; i=1,\ldots ,n \}\), and then predicting \(\{\widehat{Y}_{t_k}(\mathbf{s}_i)\}_{k=1}^{K}\) from the remaining data. We used the cross-validation statistics suggested by Carroll and Cressie (1997), given by

$$\begin{aligned} CR_1(\mathbf{s}_i)= & {} \frac{\sum _{k=1}^{K} \left[ Y_{t_k}(\mathbf{s}_i){-} \widehat{Y}_{t_k}(\mathbf{s}_i) \right] }{\left[ \sum _{k=1}^{K} {\varvec{\Delta }}_{t_k}(\mathbf{s}_i) \right] ^{1/2} }, {\quad } CR_2(\mathbf{s}_i)= \left\{ \frac{ \sum _{k=1}^{K} \left[ Y_{t_k}(\mathbf{s}_i){-} \widehat{Y}_{t_k}(\mathbf{s}_i) \right] ^2 }{ \sum _{k=1}^{K} {\varvec{\Delta }}_{t_k}(\mathbf{s}_i)}\right\} ^{1/2},\\ CR_3(\mathbf{s}_i)= & {} \left\{ \sum _{k=1}^{K} \frac{\left[ Y_{t_k}(\mathbf{s}_i)- \widehat{Y}_{t_k}(\mathbf{s}_i) \right] ^2}{T} \right\} ^{1/2},\\ \end{aligned}$$

where \(\widehat{Y}_{t_k}(\mathbf{s}_i)\) is the prediction of the process at location \(\mathbf{s}_i\) and \(t_{k} ~ (k=1,\ldots , K)\), and \({\varvec{\Delta }}_{t_k}(\mathbf{s}_i)\) is the corresponding prediction variance. These values are obtained from (13) and (14), respectively. CR\(_1\) indicates the unbiasedness of the predictor and should be approximately equal to zero. CR\(_2\) checks the accuracy of the standard deviation of the prediction error and should be approximately equal to one. CR\(_3\) is a measure of goodness of prediction. One would like CR\(_3\) to be small, which indicates that the predicted values are close to the true values. For the data generation scheme, ARMA(1, 1) is generated with parameters \((\phi , \theta , \sigma ^2, \rho )=(0.45, 0.3,0.5,0.35)\) whereas for the ARFIMA(0, d, 1) we used \((d, \theta , \sigma ^2, \rho )=(0.15, 0.45, 1.5, 0.55)\) with sample sizes of \(T=250\), \(N=25\) locations and truncation level, \(m=10\). Table 3 displays the average on the locations of the cross-validation statistics based on predictions for 5 days, i.e., \(K=5\). It can be noticed that all these measures favor our true simulated data.

Table 3 Cross-validation statistics

4.3 Estimation with a small number of observations and missing data

The performance of the Kalman filter estimator when the size of the time series is small and has missing observations is assessed in this section. For this case, we use an AR(1) for the temporal structure, and for the spatial dependence we use a third model defined as

  • Model 3: \(\nu =\infty \), corresponding to the Gaussian model

    $$\begin{aligned} C^{\eta }(\xi ; {\varvec{\theta }})=\sigma ^2 \exp \{-\rho ^2 \xi ^2\}, \quad \xi \ge 0, \quad {\varvec{\theta }}=(\sigma ^2, \rho , \infty )^{\top }. \end{aligned}$$

We consider a regular grid on the square \([0,10]^2\) and level of truncation \(m= 5\). In addition, we incorporate \(10\%\) and \(20\%\) of missing values, which have been randomly selected for each simulation over 10 locations specified in the cartesian plane with coordinates (i, 8) for \(i=1,\ldots ,10\). For the AR(1) case, we consider \(\phi =0.65\), variance scale \(\sigma ^2= 1\) and spatial correlation \(\rho =0.5\). Finally, we work with sample sizes \(T=15, 50\), and 100 replications.

Table 4 Results for the Kalman filter estimates for an AR(1) with observed locations on the square \([0,10]^2\)

Table 4 shows the estimates of the parameters and the mean square errors. Note that these estimates are very close to their theoretical counterparts. As expected, the precision of the estimates worsens as the percentage of missing data increases.

5 Real data applications

5.1 TOMS data

The Kalman filter algorithm presented in Sect. 3.1 is now applied to Level 3 Total Ozone Mapping Spectrometer (TOMS) data. TOMS Level-3 data have been analyzed in some recent papers, including Jun and Stein (2008) and Porcu et al. (2015). We refer to these papers for a detailed description of the data. It is worth pointing out that TOMS data are located on a spatially regular grid of \(1^{\circ }\) latitude by \(1.25^{\circ }\) longitude away from the poles, i.e., from a latitude interval \([-89.5,89.5]\) to a longitude interval \([-180,180]\). We focus our analysis on selected 140 spatial points with all temporal observations for a total of 2100 observations, both in space and time. In addition, we converted coordinates from longitude/latitude to universal transverse mercator (utm) coordinates. The projections are obtained using spTransform from the rgdal package (Bivand et al. 2015) which uses the PROJ.4 projection library to perform the calculations. The temporal covariance was analyzed by considering three structures, namely AR(1), AR(2) and ARMA(1,1) models. For the spatial covariance, we considered Model 1, 2 and 3 as described in the previous Section. Table 5 reports the parameter estimates using the Kalman filter with truncation level \(m=5\). From Table 5, we note that the difference between the estimation performance is not so apparent. On the other hand, parameters \(\phi _2\) and \(\theta \) provide relatively little information compared to the other parameters, indicating that the temporal model has a potential correlation structure of AR(1) type. We computed these statistics for predicting the last day of measurement, i.e., \(K = 1\). The averages of the cross-validation statistics are presented in Table 6. We note no significant difference between all these measures. Nevertheless, we can conclude that for Model 1 and AR(1) model, the CR\(_3\) exhibits a smaller value than the other cases. Figure 3 displays the performance of this model in terms of the marginal spatial semivariograms and marginal sample autocorrelation function (ACF). As shown in this figure, the Kalman filter estimation offers a better fit for the modeling of TOMS data.

Table 5 Parameter estimates for the TOMS data
Table 6 Cross-validation statistics for the TOMS data
Fig. 3
figure 3

TOMS data: a marginal spatial semivariogram versus estimated spatial semivariogram b marginal sample ACF versus estimated temporal AR(1) model

5.2 Irish wind data

Wind energy has grown significantly in developed countries and supplying electricity depends on methods of predicting wind speed in certain locations. The Irish wind speed data have been studied by several authors, in particular, Haslett and Raftery (1989), Gneiting (2002), Stein (2005) and Bevilacqua et al. (2012). Following Haslett and Raftery (1989), we omitted the Rosslare station and then considered a square root transformation to have deseasonalized data. The seasonal component was estimated by calculating the average of the square roots of the daily means over all years and stations for each day of the year, and regressing the result on a set of annual harmonics. We refer to their paper for a detailed description of these data. Following Haslett and Raftery (1989), we use a long-memory process to model the temporal dependence, and an exponential model for the spatial covariance, defined as

$$\begin{aligned} C^{\eta }(\xi ; {\varvec{\theta }})= \left\{ \begin{array}{ll}\sigma ^2 \exp \{-\rho \xi \} &{}\quad \text{ if } \quad \xi \ne 0\\ 1 &{}\quad \text{ if } \quad \xi = 0, \end{array} \right. \end{aligned}$$
Table 7 Cross-validation statistics for the Irish wind data

where \({\varvec{\theta }}=(\sigma ^2, \rho )^{\top }\), with \(\sigma ^2 \in (0, 1]\) and \(\rho >0\). In this direction, we have proposed three different models for the temporal dependence, namely FN (d), ARFIMA(1, d, 0) and ARFIMA(2, d, 0). In order to evaluate a possible structure of short-memory on the temporal covariance, ARMA(1, 1) and AR(1) models are also proposed. In addition, we considered all data except the last week, which will be used as a validation set. In order to obtain the Kalman filter estimation, we used a truncation level \(m=10\). The parameters estimated for these cases are given as follows:

  • For the AR(1) model, \(\widehat{\phi }=0.8564\), \(\widehat{\sigma }^2=0.7654\) and \(\widehat{\rho }=0.00438\), whereas for the ARMA(1, 1), \(\widehat{\phi }=0.6523\), \(\widehat{\theta }=0.3812\), \(\widehat{\sigma }^2=0.5841\) and \(\widehat{\rho }= 0.00817\).

  • For the FN(d) model, the estimates are: \(\widehat{d}=0.3373\), \(\widehat{\sigma }^2= 0.98799\) and \(\widehat{\rho }=0.00137\).

  • For the ARFIMA(1, d, 0) model, we have that \(\widehat{d}= 0.3137\), \(\widehat{\phi }= 0.04386\), \(\widehat{\sigma }^2= 0.98878\) and \(\widehat{\rho }= 0.00147\).

  • For the ARFIMA(2, d, 0) model, \(\widehat{d}= 0.3251\), \(\widehat{\phi }_1= 0.0101 \), \(\widehat{\phi }_2= -0.0599\), \(\widehat{\sigma }^2= 0.99786\) and \(\widehat{\rho }= 0.00164\).

To choose the best model, we considered the previous defined cross-validation statistics. In this case, we perform predictions for 7 days, i.e., \(K=7\). From Table 7, we can see that for the ARFIMA(1, d, 0) model, the CR\(_3\) exhibits a smaller value than the other cases. From these results, we focus on the goodness-of-fit analysis of the long-memory models. Figure 4 exhibits two panels exploring the correlations of the stations and the marginal sample ACF. Note that from panel (b) all the marginal sample ACF decay slowly, confirming a long-memory behavior. The dashed line represents the FN(d) model, the continuous line corresponds to the ARFIMA(1, d, 0) model, while the dotted line represents the behavior of the ARFIMA(2, d, 0) case. It seems that the ARFIMA(1, d, 0) model offers a better fit to the temporal sample ACF, whereas the behavior of the spatial correlations are very similar.

Fig. 4
figure 4

Irish wind data: a distance correlation plot and b sample ACF. FN(d) model (dashed line), ARFIMA(1, d, 0) (continuous line), ARFIMA(2, d, 0) (dotted line)

6 Discussion

In this article, we have proposed a state-space methodology to model spatio-temporal processes. In particular, we have proposed to model the temporal dependence structure both short- and/or long-memory through the infinite moving average representation MA\((\infty )\). In this context, we have incorporated the ARFIMA models to quantify the temporal correlation and Matérn covariance models to characterize the spatial correlation in the spatio-temporal processes. In terms of the estimation procedure, we have proposed an approximation to the likelihood functions via truncation which provides an efficient means to calculate the MLE. Simulation studies evidenced that the proposed approach can be extremely efficient for small truncation levels. Furthermore, this approach allows to overcome the computational burdens while reducing substantially the size of the required memory whenever we deal with large spatio-temporal datasets.

In addition, we used the Kalman filter algorithm to obtain the k- step ahead prediction and handle missing values without any additional assumption or additional procedure of imputation to fill in the missing values. These features provide clear advantages over alternatives procedures that deal with spatio-temporal models.

An interesting direction for future research is to use state-space models in the MA\((\infty )\) expansion to incorporate non-stationarity, by introducing time-varying models and/or location-dependence processes on the observation operator \(G_{t}(\mathbf{s})\). Although this would imply a significant increase of parameters to be estimated, it would require only minor changes in the algorithms presented here. Rao (2008) proposed a local least squares method to estimate the parameters of a spatio-temporal model with location-dependent parameters which are used to describe spatial non-stationarity, and we have recently begun to work combining these two ideas.