Keywords

1 Introduction

Gaussian Processes (GPs) [15] are a powerful tool for modeling correlated observations, including time series. GPs have been used for the analysis of astronomical time series (see [4] and the references therein), forecasting of electric load [12] and analysis of correlated and irregularly-sampled time series [16].

A kernel composition specific for time series has been recently proposed [3]. It contains linear trend, periodic patterns, and other flexible kernel for modeling the non-linear trend. By setting priors on the hyperparameters, which keep the inference within a reasonable range even on short time series, the GP yields very accurate forecasts, outperforming the traditional time series models.

Note that the above GP based model is a type of Generalised Additive Model (GAM) [26]. However, contrarily to traditional GAMs, it uses different nonparametric components for the periodic and non-linear terms, and it is estimated in a fully Bayesian way (that is, without backfitting).

Yet, GPs have computational complexity \(O(n^3)\) and storage demands of \(O(n^2)\); hence, they are not suitable for large datasets. Several approximations have been proposed to reduce their computational complexity to O(n), such as sparse approximations based on inducing points [1, 6, 7, 14, 19, 20, 24], which however add additional hyperparameters.

In the case of time series, it is possible to represent the full GP as a State Space model, without the need for any additional hyperparameter [2, 11, 13, 17, 18, 22] and with O(n) complexity.

We focus on the SS representation of the GP and we provide the following contributions. We discuss how to represent the model of [3] as a SS model, obtaining almost identical results on the time series of the M3 competition.

We also apply the GP model of [3] to very long time series, thanks to the SS representation. Also in this case we obtain positive results w.r.t the competitors.

Moreover, once the covariance functions of the Gaussian are represented in the SS framework, they can be combined with the existing SS models. This opens up the possibility of developing novel time series models. As a proof of concept, we consider a traditional state-space model (additive exponential smoothing) and we replace its seasonal component with the SS representation of the periodic kernel of the GP. We obtain a less parameterized model, which has higher accuracy on the time series of the M3 competition. The resulting model is also more flexible; for instance, it could be easily extended to manage time series containing multiple seasonal patterns, unlike the traditional exponential smoothing.

2 Background

In the following section, we provide a background on (i) Gaussian Processes; (ii) State Space models; (iii) the State Space representation of Gaussian Processes.

2.1 Gaussian Process

We consider the regression model

$$\begin{aligned} y=f(\mathbf {x})+v, \end{aligned}$$
(1)

where \(\mathbf {x} \in \mathbb {R}^p\), \(f : \mathbb {R}^p \rightarrow \mathbb {R}\) and \(v\sim N(0,s_v^2)\) is the noise. Our goal is to estimate f given the training data \(\mathcal {D}=\{(\mathbf {x}_i,y_i), ~i=1,\dots ,n\}\). In GP regression, we place a GP prior on the unknown f, \(f \sim GP(0,k_{\boldsymbol{\theta }})\),Footnote 1 and calculate the posterior distribution of f given the data \(\mathcal {D}\). We then employ this posterior to make inferences about f.

In particular, we are interested in predictive inferences. Based on the training data \(X^T=[\mathbf {x}_1,\dots ,\mathbf {x}_n]\), \(\mathbf {y}=[y_1,\dots ,y_n]^T\) , and given m test inputs \((X^*)^T=[\mathbf {x}_1^*,\dots ,\mathbf {x}_m^*]\) , we aim to find the posterior distribution of \(\mathbf {f}^*= [f(\mathbf {x}^*_1),\dots ,f(\mathbf {x}^*_m)]^T\). From (1) and the properties of the Gaussian distribution,Footnote 2 the posterior distribution of \(\mathbf {f}^*\) is Gaussian [15, Sec. 2.2]:

$$\begin{aligned} p(\mathbf {f}^*|X^*,X,\mathbf {y},{\boldsymbol{\theta }}) = N(\mathbf {f}^*; \hat{\boldsymbol{\mu }}_{\boldsymbol{\theta }}(X^*|X,\mathbf {y}),\hat{K}_{\boldsymbol{\theta }}(X^*,X^*|X)), \end{aligned}$$
(2)

with mean and covariance given by:

$$\begin{aligned} \nonumber&\hat{\boldsymbol{\mu }}_{\boldsymbol{\theta }}(\mathbf {f}^*|X,\mathbf {y})=K_{\boldsymbol{\theta }}(X^*,X)(K_{\boldsymbol{\theta }}(X,X))^{-1}\mathbf {y},\\&\hat{K}_{\boldsymbol{\theta }}(X^*,X^*|X)=K_{\boldsymbol{\theta }}(X^*,X^*)-K_{\boldsymbol{\theta }}(X^*,X)(K_{\boldsymbol{\theta }}(X,X))^{-1}K_{\boldsymbol{\theta }}(X,X^*). \end{aligned}$$
(3)

In GPs, the kernel defines the Covariance Function (CF) between any two function values: \(Cov(f(\mathbf {x}), f(\mathbf {x}^*) ) = k_{\boldsymbol{\theta }}(\mathbf {x}, \mathbf {x}^* )\). Common kernels are the White Noise (WN), the Linear (LIN), the Matern 3/2 (MAT32), the Matern 5/2 (MAT52), the Squared Exponential (RBF), the Cosine (COS) and the Periodic (PER). Hereafter, we provide the expressions of these kernels for \(p=1\), which is the case of time series; see instead [15] for generalizations:

$$\begin{aligned}&\text {WN: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_v^2 \delta _{x_1,x_2}\\&\text {LIN: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_b^2+ s_l^2 x_1x_2\\&\text {MAT32: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_e^2 \left( 1 +\tfrac{\sqrt{3}|x_1-x_2|}{\ell _e}\right) \exp \left( -\tfrac{\sqrt{3}|x_1-x_2|}{\ell _e}\right) \\&\text {MAT52: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_e^2 \left( 1 +\tfrac{\sqrt{5}|x_1-x_2|}{\ell _e}+\tfrac{5(x_1-x_2)^2}{3 \ell ^2_e} \right) \exp \left( -\tfrac{\sqrt{5}|x_1-x_2|}{\ell _e}\right) \\&\text {RBF: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_r^2 \exp \left( -\frac{(x_1-x_2)^2}{2 \ell _r^2}\right) \\&\text {COS: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_c^2 \cos \left( \frac{x_1-x_2}{\tau }\right) \\&\text {PER: } \displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_p^2 \exp \left( -\frac{(2 \sin ^2(\pi |x_1-x_2|/p_e)}{\ell _p^2}\right) \\ \end{aligned}$$

where \(\delta _{x_1,x_2}\) is the Kronecker delta, which equals one when \(x_1=x_2\) and zero otherwise. The hyperparameters are the variances \(s^2_v,s^2_l,s^2_e,s^2_r,s^2_c,s^2_p>0\), the lengthscales \(\ell _r,\ell _e,\ell _p,\tau >0\) and the period \(p_e\).

Selecting a kernel, or a combination of kernels, to determine the structure of the covariance is a crucial factor governing the performance of a GP model. Spectral mixture kernels (SM) [25] have been devised to overcome this issue thanks to their property of being able to approximate any stationary kernel.Footnote 3 SM define a covariance kernel by taking the inverse Fourier transform of a weighted sum of different shifts of a probability density. In the original formulation [25], the authors considered a Gaussian PDF, resulting into a covariance kernel which is the sum of the RBF\(\times \)COS kernels, so each term in the sum is equal to:

$$\begin{aligned} \text {SM}_i\text {:}\;\displaystyle k_{\boldsymbol{\theta }}(x_1, x_2) = s_{m_i}^2 \exp \left( -\frac{(x_1-x_2)^2}{2 \ell _{m_i}^2}\right) \cos \left( \frac{x_1-x_2}{\tau _{m_i}} \right) ,\\ \end{aligned}$$

with hyperparameters \(s_{m_i}\), \(\ell _{m_i}\) and \(\tau _{m_i}\).

Learning the Hyperparameters. We denote by \(\boldsymbol{\theta }\) the vector containing all the kernels’ hyperparameters. In practical application of GPs, \(\boldsymbol{\theta }\) have to be selected. We use Bayesian model selection to consistently set such parameters. Variances and lengthscales are non-negative hyperparameters, to which we assign log-normal priors (later we show how we define the priors). We then compute the maximum a-posteriori (MAP) estimate of \(\boldsymbol{\theta }\), that is we maximize w.r.t. \(\boldsymbol{\theta }\) the joint marginal probability \(p(\mathbf {y},\boldsymbol{\theta })\), which is the product of the prior \(p(\boldsymbol{\theta })\) and the marginal likelihood [15, Ch.2]:

$$\begin{aligned} \begin{array}{ll} p(\mathbf {y}|X,\boldsymbol{\theta })=&N(\mathbf {y};0, K_{\boldsymbol{\theta }}(X,X)).\end{array} \end{aligned}$$
(4)

Usually \(\boldsymbol{\theta }\) is selected by maximizing the marginal likelihood of Eq. (4). Yet, better estimates can be obtained by assigning prior to the hyperparameters and then performing MAP estimation. The MAP approach yields reliably estimates also on short time series, as pointed out by [3], in which it is also proposed a methodology to define such priors.

2.2 State Space Models

Consider the following stochastic continuous time-variant (LTV) State Space (SS) model [10]

$$\begin{aligned} \left\{ \begin{array}{rcl} d \mathbf {f}(t)&{}=&{}\mathbf {F}(t)\,\mathbf {f}(t)dt+\mathbf {L}(t)\,dw(t),\\ y(t_k)&{}=&{}\mathbf {C}(t_k)\,\mathbf {f}(t_k), \end{array}\right. \end{aligned}$$
(5)

where \(\mathbf {f}(t)=[f_1(t),\dots ,f_m(t)]^T\) is the state vector,Footnote 4 \(y(t_k)\) is the measurement at time \(t_k\), \(\mathbf {F}(t),\mathbf {C}(t),\mathbf {L}(t)\) are known matrices of appropriate dimensions and w(t) is a one-dimensional Wiener noise process with intensity q(t). We further assume that the initial state \(\mathbf {f}(t_0)\) and w(t) are independent for each \(t\ge t_0\). The solution of the stochastic differential equation in (5) is [10]:

$$\begin{aligned} \mathbf {f}(t_k)=\boldsymbol{\psi }(t_k,t_0)\,\mathbf {f}(t_0)+\int \limits _{t_0}^{t_k} \boldsymbol{\psi }(t_k,\tau )\mathbf {L}(\tau )\,dw(\tau ), \end{aligned}$$
(6)

with \(\boldsymbol{\psi }(t_k,t_0)= \exp (\int _{t_0}^{t_k} \mathbf {F}(t)dt)\) is the state transition matrix, which is computed as a matrix exponential.Footnote 5 Assuming that \(E[\mathbf {f}(t_0)]=\mathbf {0}\), then it can be easily proven that the vector of observations \([y(t_1),y(t_2),\dots ,y(t_n)]^T\) is Gaussian distributed with zero mean and covariance matrix whose elements are given by:

$$\begin{aligned} \begin{array}{rl} E[y(t_i)y(t_j)] &{}=\mathbf {C}(t_i)\boldsymbol{\psi }(t_i,t_0)E[\mathbf {f}(t_0)\mathbf {f}^T(t_0)] (\mathbf {C}(t_j)\boldsymbol{\psi }(t_j,t_0))^T\\ &{}+\int \limits _{t_0}^{\min (t_i,t_j)} h(t_i,u)h(t_j,u)q(u)du \end{array} \end{aligned}$$
(7)

where we have exploited the fact that \(E[dw(u)dw(v)]=q(u)\delta (u-v)dudv\) [10] and defined \(h(t_1,t_2)=\mathbf {C}(t_1) \boldsymbol{\psi }(t_1,t_2)\mathbf {L}(\tau )\).

In SS models, one aims to estimate the states \(\mathbf {f}(t_1),\dots ,\mathbf {f}(t_n)\) given the observations \(y(t_1),\dots ,y(t_n)\) and the initial condition. There are in particular two problems of interest: (i) filtering whose aim is to compute \(p(\mathbf {f}(t_k)|y(t_1),\dots ,y(t_k))\) for every \(t_k\); (ii) smoothing whose aim is to compute \(p(\mathbf {f}(t_k)|y(t_1),\dots ,y(t_n))\) for every \(t_k\). For stochastic LTV systems, filtering and smoothing can be solved exactly using the Kalman Filter (KF) and the Rauch-Tung-Striebel smoother [10] with complexity \(\mathcal {O}(n)\).

2.3 SS Models Representation of GPs

When the GP has one-dimensional input, it is possible to represent (or approximate) the GP with a SS model. The advantage of the SS representation is that estimates and inferences can be computed with complexity \(\mathcal {O}(n)\). In practice, one has to find a SS whose covariance matrix (7) coincides (or approximates) that of the GP. This provides the SS representation of the GP, which then allows us to estimate \(\mathbf {f}(t_k)\) given data \(\{y(t_1),\dots ,y(t_n)\}\) using the KF and the Rauch-Tung-Striebel smoother (with complexity \(\mathcal {O}(n)\)). This can be obtained as follows:

  1. 1.

    Discretize the continuous-time SS to obtain a discrete-time SS (this step basically consists on applying (6)):

    $$\begin{aligned} \left\{ \begin{array}{rcl} \mathbf {f}(t_k)&{}=&{}\boldsymbol{\psi }(t_k,t_{k-1})\mathbf {f}(t_{k-1})+\boldsymbol{\nu }(t_{k-1}),\\ y(t_k)&{}=&{}\mathbf {C}(t_k)\,\mathbf {f}(t_k), \end{array}\right. \end{aligned}$$
    (8)

    where \(\boldsymbol{\nu }(t_{k-1})=\int _{t_{k-1}}^{t_k} \boldsymbol{\psi }(t_k,\tau )\mathbf {L}\,dw(\tau )\).

  2. 2.

    Compute the probability density function (PDF) \(p(\mathbf {f}(t_k)|y(t_1),\dots ,y(t_k))\), which is Gaussian. The mean and covariance matrix of this Gaussian PDF can be computed efficiently by using the KF.

  3. 3.

    Compute the Gaussian posterior PDF \(p(\mathbf {x}(t_k)|y(t_1),\dots ,y(t_n))\) – the mean and covariance matrix of this PDF can be computed very efficiently by using the Rauch-Tung-Striebel smoother. This step returns the estimates of the state given all observations.

  4. 4.

    To estimate the hyperparameters of the CF, we can perform MAP (as for GPs). Note that, the marginal likelihood of the SS model can be computed efficiently by the KF.

State Space Representation of Covariance Functions. The time continuous SS representation of the covariance functions of Sect. 2.1 is given in Table 1. Such representations do not include the variance scaling parameter that multiplies the CF; it can be however included in the SS model by rescaling either the stochastic forcing term or the initial condition (for SS without forcing term).

Table 1. SS representation of the CFs. When the distribution of the initial state is not provided, it is assumed to be equal to zero. The intensity of the Wiener process w is assumed to be \(q=1\).

Representing Compositions of Covariance Functions. Additive combination of covariance functions can be represented by stacking SS models; this is called cascade composition [17]. For instance, the SS model corresponding to WN+LIN is:

$$\begin{aligned} \left\{ \begin{array}{rcl} \frac{d f_1}{dt}(t)&{}=&{}\frac{dw}{dt}(t)\\ \frac{d f_2}{dt}(t)&{}=&{}f_3(t)\\ \frac{d f_3}{dt}(t)&{}=&{}0 \\ y(t_k)&{}=&{}f_1(t_k)+f_2(t_k) \end{array}\right. ~\begin{bmatrix}f_2(t_0)\\ f_3(t_0)\end{bmatrix} \sim \mathcal {N}\left( \begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}s_b^2 &{} 0\\ 0 &{}s_l^2\end{bmatrix}\right) . \end{aligned}$$

Multiplicative composition of covariance functions can be obtained via parallel composition [17] of SS models. For instance, the COS \(\times \) MAT32 kernel is represented as:

$$\begin{aligned} \left\{ \begin{array}{rcl} \frac{d f_1}{dt}(t)&{}=&{} \omega f_2(t) + f_3(t) \\ \frac{d f_2}{dt}(t)&{}=&{} -\omega f_1(t) + f_4(t)\\ \frac{d f_3}{dt}(t)&{}=&{} -\tfrac{3}{\ell ^2}f_1(t) - \tfrac{2\sqrt{3}}{\ell }f_3(t)+\omega f_4(t)+ \tfrac{12\sqrt{3}}{\ell ^3}\frac{dw_1}{dt}(t)\\ \frac{d f_4}{dt}(t)&{}=&{} -\tfrac{3}{\ell ^2}f_2(t) -\omega f_3(t) - \tfrac{2\sqrt{3}}{\ell }f_4(t)+\tfrac{12\sqrt{3}}{\ell ^3}\frac{dw_2}{dt}(t)\\ y(t_k)&{}=&{}f_1(t_k) \end{array}\right. \end{aligned}$$

The RBF and PER kernel do not admit an exact SS representation; for this reason, they are not shown in Table 1. However, an approximated SS representation can be given. The PER kernel can be approximated as the sum of different Cosine covariance functions (COS + COS + ... + COS), with a suitable choice of their lengthscales (defined using a Fourier series expansion of the PER kernel) [21]. In this paper, we use 7 COS terms to approximate the PER kernel. The RBF kernel can be approximated by a SS model based on the Matern d/2 kernel, where \(d=1,3,5,7,9,\dots \) and the approximation improves as d increases. In this paper, we will use \(d=3\).

2.4 Time Series Forecasting and Priors

In [3], GP regression was proposed for time series forecasting using the following composite kernel:

$$\begin{aligned} \text {K = PER + LIN + RBF + SM}_1 \text { + SM}_2 \text { + WN.} \end{aligned}$$
(9)

The periodic kernel (PER) captures the seasonality of the time series. LIN captures the linear trend. Long-term trends are generally smooth, and can be properly modelled by the RBF kernel. The two SM kernels are used to pick up the remaining signal. Finally, the WN kernel represents the observation (Gaussian) noise.

Table 2. Parameters of the lognormal priors. The same prior is adopted for the variances of all components in Eq. (9)

This results in a kernel capturing a wide range of patterns but comprising 16 hyperparameters, which must be estimated from data. This might be challenging on short time series, such as monthly or quarterly ones. In [3] the problem is addressed by setting priors on the hyperparameters. In particular, lognormal priors are adopted and they are defined through a hierarchical Bayes approach, i.e., by analyzing a subset of monthly time series from the M3 competition. The priors, which we also adopt, are given in Table 2.

Fig. 1.
figure 1

Comparison of GP and SS forecasts. The blue dots are the training data and the purple dots the test data. The small differences between full GP and SS are due to the slightly different estimation of the hyperparameters. The time series are monthly and the forecasts are computed up to 1.5 years ahead; time is expressed in years.

2.5 SS Approximation

To achieve O(n) complexity, we replace the kernel in (9) with this approximation

$$\begin{aligned} \tilde{K} \text { = (+}_{8}\text {COS) + LIN + MAT32 + COS } \times \text { MAT32 + COS } \times \text { MAT32 + WN.} \end{aligned}$$
(10)

Note we have approximated PER with the sum of 7 COS kernel and RBF with MAT32.Footnote 6 A GP with the above kernel can equivalently be represented by a SS model who state has dimension \(7\times 2+2+2+4+4+1=24\).

Figure 1 compares the GP estimate and forecast based on the kernel (9) and the SS approximation based on the kernel (10) on some time series from the M3 competition.Footnote 7 The SS approximation provides close forecasts to the full GP. We provide a more in-depth analysis when discussing the experiments.

2.6 Combining GP Kernel with Exponential Smoothing

Our framework is so flexible, that it allows combining the state-space representations of covariance functions and existing state-space models, thus obtaining some novel time series models.

As a proof of concept, we consider state-space additive exponential smoothing (additive ets), and we replace its seasonal component with the PER kernel.

The discrete-time SS representation of exponential smoothing with linear trend is [8]:

$$\begin{aligned} \text {Holt:} \left\{ \begin{array}{rcl} f_1((k+1)\varDelta _t)&{}=&{} f_1(k\varDelta _t)+f_2(k\varDelta _t)+\alpha w((k+1)\varDelta _t) \\ f_2((k+1)\varDelta _t)&{}=&{} f_2(k\varDelta _t)+\alpha \beta w((k+1)\varDelta _t) \\ y((k+1)\varDelta _t)&{}=&{}f_1(k\varDelta _t)+f_2(k\varDelta _t) + w((k+1)\varDelta _t) \end{array}\right. \begin{bmatrix}f_1(t_0)\\ f_2(t_0)\end{bmatrix} \sim \mathcal {N}\left( \begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}s_l^2 &{} 0\\ 0 &{}s_b^2 \end{bmatrix}\right) \end{aligned}$$

where \(\varDelta _t\) is the sampling frequency and w are independent Gaussian noises with zero mean and variance \(s_v^2\). Such model has five parameters: \(\alpha ,\beta \in [0,1]\) and \(s_l^2,s_b^2,s_v^2\).

We then complete the SS model by adding the (approximated) SS representation of the PER kernel, constituted by the sum of seven COS covariance functions. When estimating the hyperparameters, automatic relevance determination (ARD) automatically makes irrelevant the unnecessary component, without the need for a separate model selection step.Footnote 8

3 Experiments

We consider the following GP models:

  • full-GP: the model of Eq. (9), trained with priors [3];

  • full-GP\(_0\): the same model, trained by maximizing the marginal likelihood (no priors);

  • SS-GP and SS-GP\(_0\), i.e., the corresponding SS models (Eq. 10) trained with and without priors.

We use a single restart when training all the models.

As benchmarks, we consider auto.arima and ets, both available from the forecast package [9]. The auto.arima algorithm first makes the time series stationary via differentiation; then it fits an ARMA model selecting the orders via AICc. The ets algorithm fits several state-space exponential smoothing models [8], characterized by different types of trend, seasonality and noise; the best model is eventually chosen via AICc. All the considered models represent the forecast uncertainty via a Gaussian distribution.

Metrics. As performance metric, we consider the mean absolute error (MAE) on the test set:

$$\begin{aligned}&\text {MAE}= \sum _{t=1}^{T} |y_t - \hat{y}_t|\\ \end{aligned}$$

where we denote by \(y_t\) and \(\hat{y}_t\) the actual value and the expected value of the time series at time t; \(\sigma ^2_t\) denotes the variance of the forecast at time t and by \(\text {T}\) the length of the test set.

Furthermore, we compute the continuous-ranked probability score (CRPS) [5], which generalizes the MAE to the case of probabilistic forecasts. It is a proper scoring rule for probabilistic forecasts, which corresponds to the integral of the Brier scores over the continuous predictive distribution. MAE and CRPS are loss functions, hence the lower the better.

3.1 Monthly M3

Table 3. Performance on the M3 monthly time series.

The M3 competition includes 1489 monthly time series. We exclude 350 of them, which were used in [3] to define the priors of Table 2, which we also adopt. We thus run experiments on the remaining 1079 monthly time series. The length of training set varies between 49 and 126 months, while the test set is always 18 months long. We standardize each time series using the mean and the standard deviation of the training set. We fix the period of the periodic kernel to one year, which is standard practice for M3.

The median and mean results for time series are given in Table 3. The SS-GP and full-GP obtain the best median and mean performance on all indicators. The performance of full-GP and of its state-space representation is practically identical, showing that the SS approximation is very accurate. We tried also Prophet [23] but its accuracy was not competitive. We thus dropped it.

The large improvement of full-GP and SS-GP over full-GP\(_0\) and SS-GP\(_0\) confirms that the priors are necessary to exploit the potential of the GP.

3.2 Combining GP Kernel and Exponential Smoothing

The SS representation of GPs allows us to combine GPs with state-of-the-art models for time series forecasting such the ETS model [8].

In this section, we compare the SS model discussed previously, which uses the following kernel:

$$\begin{aligned} \tilde{K}_1 = (+_{7}\text {COS) + Holt}, \end{aligned}$$
(11)

where the Holt kernel has been defined in Sect. 2.6.

We compare this model with additive ETS model, defined as follows. The additive ets model fits four different models via maximum likelihood and performs model selection via AICc. The four models are simple exponential smoothing (ses, no trend and no seasonality), ses with linear trend, ses with linear trend and additive seasonality, ses with additive seasonality but no trend. We implement all such models using the forecast package for R [9]. The ets framework makes available also multiplicative models, that however we do not consider in this section.

The seasonal component of exponential smoothing has some shortcomings: it requires to estimate (\(m+1\)) parameters, where m denotes then number of samples within a period (e.g., \(m=12\) for monthly time series); moreover, it does not manage complex seasonalities such non-integer periods or multiple seasonal pattern. In our model we thus substitute it with the PER kernel (equivalently (\(+_{7}\)COS) kernel), which has only two (hyper)-parameters and which can model complex seasonalities (e.g., multiple seasonalities can be modelled by adding multiple PER kernels).

Therefore, the main differences between additive ets and our novel model are thus:

  • PER kernel vs seasonal component of exponential smoothing;

  • automatic relevance determination vs model selection.

The simulation results are shown in Table 4. SS-GP is again the best model. Comparing SS-GP performance in Table 3 and 4 is evident that the more complex kernel (10) provides a better the performance. However, this shows how the SS representation of GPs opens up the possibility of developing novel time series models combining traditional time series models with “machine-learning-like” models.

Table 4. Performance on M3 monthly. SS-GP with kernel \(\tilde{K}_1\) compared to additive ETS.

3.3 Large Datasets and Multiple Seasonality

By contrast to full GP, SS models can scale to large datasets. We provide a proof-of-concept of that by applying the SS model to two time series in the UCI’s Electricity Dataset. Each time series is relative to the electricity consumption of client from a period of 2011 to 2014 at an interval of 15 min. The goal is to forecast the electricity consumption one week ahead. The length of each time series is 23997 and, therefore, we cannot run full GP (on a standard PC). Moreover, the time series have both daily and weekly periodicity, which means the kernel in (10) is not appropriate.

Fig. 2.
figure 2

Two time series taken from the Electricity Dataset

Fig. 3.
figure 3

One week ahead forecast computed by (i) the proposed SS model; (ii) Facebook’s Prophet; for the two time series in Fig. 2. The time has been normalized: 1 is one year.

However, we can easily deal with multiple seasonality by adding another periodic component to the kernel:

$$\begin{aligned} \tilde{K} = (+_{7}\text {COS})+(+_{7}\text {COS) + LIN + MAT32 + COS } \times \text { MAT32 + COS } \times \text { MAT32 + WN} \end{aligned}$$
(12)

where the first periodic kernel (the term \((+_{7}\text {COS})\)) has period 1/365.25 and the second 7/365.25.Footnote 9

Figure 2 shows two time series taken from the Electricity Dataset. Figure 3 reports the relative one week ahead forecast computed by (i) the proposed SS model; (ii) Facebook’s Prophet. The training times are of few seconds for Prophet, and about 300 s for the SS model.

While our implementation is currently slower than Prophet, it already handles flawlessly this time series. The training time of our implementation can be largely reduced by using Stochastic Gradient (SGD) optimization, thus working with minibatch of data. The forecasts show that the SS model is competitive also on long time series; however, the analysis of a large number of time series is needed in order to achieve conclusions which are significant. We defer this analysis to future work, after the completion of a faster implementation of SS-GP based on SGD.

4 Conclusions

Focusing on time series forecasting, we have shown that a Gaussian Process with a complex composite kernel can be accurately approximated by a state space model. The resulting state space model has a comparable performance, but with a complexity which scales linearly in the input size. Moreover, given state-of-the-art models for time series forecasting are implemented in state space form, the state space representation of Gaussian Processes allowed us to combine traditional models (like exponential smoothing) with kernel-based models (like periodic kernel) in a sound Bayesian manner.

Several future research directions are possible. One is the extension to time series characterized by non-Gaussian likelihoods, such as count time series or intermittent time series. Other possibilities include the combination of exponential smoothing with the spectral mixture or the Neural Network kernel. We also plan to compare our approach with other Generalised Additive (Mixture) Models used for time-series forecasting.