1 Introduction

The problem of forecasting local solar output in the short term is of significant interest for the purpose of distributed grid control and household energy management (Voyant et al. 2017; Widén et al. 2015). Variation in output is driven by two principal factors: diurnal cyclical effects (variation due to sun angle and distance) and variability due to weather effects, both inducing spatially-related dependence between proximate sites. In general, correlations across sites depend on many particulars relating to system configuration, local environment and so on. As such we wish to exploit spatial dependencies (and potentially other site-specific covariates) between sites in a flexible manner. More importantly, inherent to this application is the need for modeling uncertainty in a flexible and principled way (Antonanzas et al. 2016).

Gaussian process (gp) models are a flexible nonparametric Bayesian approach that can be applied to various problems such as regression and classification (Rasmussen and Williams 2006) and have been extended to numerous multivariate and multi-task problems including spatial and spatio-temporal contexts (Cressie and Wikle 2011). Multi-task gp methods have been developed along several lines (see e.g. Álvarez et al. 2012, for a review). Of relevance here are various mixing approaches that combine multiple latent univariate Gaussian processes via linear or nonlinear mixing to predict multiple related tasks (Wilson et al. 2012). The challenge in multi-task cases is maintaining scalability of the approach. To this end, both scalable inference methods and model constraints have been employed (Álvarez et al. 2010; Matthews et al. 2017; Krauth et al. 2017). In particular, latent Gaussian processes are generally constrained to be statistically independent (Wilson et al. 2012; Krauth et al. 2017).

In this paper we consider the case where the statistical independence constraint is relaxed for subsets of latent functions. We build on the scalable generic inference method of Krauth et al. (2017) to extend the model of Wilson et al. (2012) and allow nonzero covariance between arbitrary subsets, or ‘groups’ of latent functions. The grouping structure is flexible and can be tailored to applications of interest, and additional features can potentially be incorporated to govern input-dependent covariance across functions. By adopting separable kernel specifications, we maintain scalability of the approach whilst capturing latent dependence structures.

With this new multi-task gp model, we consider the specific challenge of forecasting power output of multiple, distributed residential solar sites. We apply our approach to capture spatial covariance between sites by explicitly incorporating spatial dependency between latent functions and test our method on three datasets comprised of solar output at distributed household sites in Australia.

For many of the same reasons, short term wind power forecasting represents significant challenges yet is critical to emerging energy technologies (Widén et al. 2015). Output variability is driven by wind speed, which (as for solar) is driven by multiple interacting environmental factors giving rise to spatial dependencies a priori. To demonstrate the broader applicability of the model, we also illustrate our approach on a wind speed dataset comprised of ground wind speed at distributed weather stations in Australia.

Our results show that, for solar models, introducing spatial covariance over groups of latent functions maintains or improves point-prediction forecast accuracy relative to competing benchmark methods and at the same time provides better quantification of predictive uncertainties. Further, wind forecast accuracy and uncertainty is improved on all measures by the introduction of spatial covariance. Timed experiments show that the new model dominates the equivalent model without spatial dependencies, achieving similar or superior forecast accuracy in a shorter time.

2 Related work

Gaussian Processes have been considered in the multi-task setting via a number of approaches. Several methods linearly combine multiple univariate gp models via coefficients that may be parameters (latent factor models as in Teh et al. 2005); linear coregional models (lcm; Goovaerts 1997), or themselves input dependent (Wilson et al. 2012).

Most mixing approaches focus on methods to combine multiple underlying independent latent functions. Recent developments in inference for multi-task gp models have improved scalability of mixing approaches, building upon the variational framework of Titsias (2009). Nguyen and Bonilla (2014) develop a generic variational inference method that allows efficient optimization for multi-task models with arbitrary likelihoods, while the sparse, variational framework of Hensman et al. (2015); Matthews et al. (2017) supported significant gains in scalability of multi-task gp models. Dezfouli and Bonilla (2015) extend the approach in Nguyen and Bonilla (2014) to the sparse variational context, exploiting inducing points to improve scalability of the inference method using a general mixture-of-Gaussian sparse, variational posterior. More recently, the approach of Dezfouli and Bonilla (2015) was extended to integrate optimization that exploits leave-one-out objective learning in addition to the sparse, variational lower bound (Bonilla et al. 2016; Krauth et al. 2017).

Other multi-task gp approaches allow task-specific predictions through use of task-specific features or ‘free-form’ cross-task covariances (Bonilla et al. 2008), and more recently priors placed over cluster allocations allowing cluster-specific covariances (Hensman et al. 2014; Gardner et al. 2018). Combination via convolutions has also been developed and extended to sparse, variational settings (Álvarez and Lawrence 2009, 2011).

Coupling between \(Q\) node (but not weight) latent functions directly is considered by Remes et al. (2017), who build upon the Gaussian process regression network (GPRN) framework of (Wilson et al. 2012). The authors propose a rich, Generalized Wishart–Gibbs kernel that characterizes covariance for latent functions. The fully-coupled kernel is internally parameterized rather than utilizing feature-dependent cross-function covariance. The approach makes use of variational inference to approximate the model. Unlike our method, however, the natural disadvantage of such an approach is that it presents significant computational challenges in terms of scalability to a large number of observations and tasks. This is primarily due to the need for variational inference that requires batch optimization with \({\mathcal {O}}((NQ)^3)\) complexity, rendering it infeasible for larger scale applications. In fact, only small experiments were carried out in Remes et al. (2017) with \(NQ\) in the order of (approximately) 100 to 500, since the approach is primarily developed for small-data problems requiring a highly expressive latent covariance structure.

2.1 Multi-task solar power forecasting

A number of studies have confirmed that multi-task learning approaches can be useful for distributed solar irradiance or solar power forecasting, finding that cross-site information is relevant (Yang et al. 2013, 2015). Several studies build on the early work of Sampson and Guttorp (1992) and consider kriging methods for distributed solar irradiance forecasting or spatial prediction (notably Yang et al. 2013; Shinozaki et al. 2016). Other approaches include a range of linear statistical methods, shown to be competitive at shorter horizons, and neural network methods (Inman et al. 2013; Widén et al. 2015; Voyant et al. 2017). These approaches are generally constrained by data requirements, notably pre-flattening of data to remove diurnal cyclical trends, which requires knowledge of local system and environment variables. In the context of small scale, distributed residential sites, such information is often unavailable, motivating approaches that do not rely on rich data history or feature sets as are typically required by current approaches (Inman et al. 2013; Widén et al. 2015; Voyant et al. 2017; Antonanzas et al. 2016; Yang et al. 2018).

In addition to kriging studies, gp models have been considered for short term solar forecasting (Bilionis et al. 2014; Dahl and Bonilla 2017). Earlier approaches are generally constrained to small-data problems by poor scalability of exact gp models. More recently, Dahl and Bonilla (2017) use scalable sparse, variational inference to apply multi-task Gaussian (mtg) and linear coregional models (lcm) to forecast solar output at multiple, distributed residential sites. Multi-task approaches are found to improve model performance in mixed weather conditions, less so in sunny conditions. The specifications adopted, however, did not show strong improvement in overall forecast accuracy relative to the naive, univariate site gp benchmarks, with the lcm performing significantly worse than mtg and individual models in that setting. Moreover, the mtg presents scalability challenges since inducing inputs in the sparse, variational framework adopted are shared across all observations and tasks.

3 Multi-task Gaussian process models

A Gaussian process (gp, Rasmussen and Williams 2006) is formally defined as a distribution over functions such that \(f( \mathbf {x} ) \sim \mathcal {GP}(\mu ( \mathbf {x} ), \kappa ( \mathbf {x} , \mathbf {x} ^\prime ) )\) is a Gaussian process with mean function \(\mu ( \mathbf {x} )\) and covariance function \(\kappa ( \mathbf {x} , \mathbf {x} ^\prime )\) iff any subset of function values \(f( \mathbf {x} _1), f( \mathbf {x} _2), \ldots , f({ \mathbf {x} _N})\) follows a Gaussian distribution with mean \(\varvec{ \mu } \) and covariance \(\mathbf {K}\), which are obtained by evaluating the corresponding mean function and covariance function at the input points \(\mathbf {X}= \{ \mathbf {x} _1, \ldots , \mathbf {x} _N \}\).

Standard single-task gp regression assumes observations are iid versions of the latent function values corrupted by Gaussian noise. In this case, posterior inference can be carried analytically (Rasmussen and Williams 2006, chap. 2). In this paper we consider the more general case of multiple outputs, which sometimes is referred to in the literature as multi-task gp regression. In other words, we are given data of the form \({\mathcal {D}}= \{\mathbf {X}\in {\mathbb {R}}^{N\times D}, \mathbf {Y}\in {\mathbb {R}}^{N\times P} \}\) where each \( \mathbf {x} _{(n)}\) in \(\mathbf {X}\) is a \(D\)-dimensional vector of input features and each \( \mathbf {y} _{(n)}\) in \(\mathbf {Y}\) is a \(P\)-dimensional vector of task outputs. Furthermore, we are interested in the case of generally non-linear (non-Gaussian) likelihoods, for which there is no analytically tractable posterior.

3.1 Latent Gaussian process models with independent priors

Fortunately, advances in variational inference (Kingma and Welling 2014; Rezende et al. 2014) have allowed the development of efficient posterior inference methods with ‘black-box’ likelihoods. In the case of models with gp priors, Krauth et al. (2017) have extended these results to modeling multiple outputs under non-linear likelihoods and independentgp priors over multiple latent functions. In short, under such a modeling framework, correlations between the \(P\) outputs using \(Q\) independent latent functions \(\{f_j( \mathbf {x} )\}_{j=1}^Q\) each drawn from a zero-mean gp prior, i.e. \(f_j(\mathbf{x}) \sim \mathcal {GP}( \mathbf {0} , \kappa _{j}( \mathbf {x} , \mathbf {x} '; \varvec{ \theta } _j ))\), can be encoded via the likelihood. As we shall see in Sect. 5, an example of this is when the independent gp s are linearly combined via a set of weights, which can be deterministic as in the semi-parametric latent factor model of Teh et al. (2005) or stochastic and input-dependent as in the gprn of Wilson et al. (2012).

Therefore, within the framework in Krauth et al. (2017) the prior over the latent function values corresponding to the \(N\) observations along with the likelihood model is given by:

$$\begin{aligned} p(\mathbf {F} | \varvec{ \theta } )&= \prod _{j=1}^{Q} p( \mathbf {f} _{j} | \varvec{ \theta } _{j} ) = \prod _{j=1}^{Q} {\mathcal {N}}( \mathbf {f} _{j}; \mathbf {0} , \mathbf {K}_{ \mathbf {x} \mathbf {x} }^j) \text {,} \end{aligned}$$
(1)
$$\begin{aligned} p(\mathbf {Y} | \mathbf {F}, \varvec{ \phi } )&= \prod _{n=1}^{N} p( \mathbf {y} _{(n)} | \mathbf {f} _{(n)}, \varvec{ \phi } ) \text {,} \end{aligned}$$
(2)

where \(\mathbf {F}\) is the \(N\times Q\) matrix of latent function variables; \( \mathbf {f} _{j}\) is the \(N\)-dimensional vector for latent function j; and \(\varvec{ \theta } _j\) the corresponding hyper-parameters; \( \mathbf {f} _{(n)}\) is the \(Q\)-dimensional vector of latent function values corresponding to observation n and \(\varvec{ \phi } \) are the likelihood parameters.

Krauth et al. (2017) exploit the structure of the model in Eq. (1) to develop a scalable algorithm via variational inference. While the likelihood in this model is suitable for most unstructured machine-learning problems such as standard regression and classification, the prior can be too restrictive for problems where dependences across tasks can be incorporated explicitly. In this paper, driven by the solar power prediction problem where spatial relatedness can be leveraged to improve predictions across different sites, we lift this statistical independence (across latent functions) constraint in the prior to propose a new multi-task model where some of the functions are coupled a priori.

4 Grouped priors for multi-task GP models

To group latent functions a priori, we can define arbitrarily chosen subsets of latent functions in \(\mathbf {F}\), \(\mathbf {F}_{r}, r= 1, \ldots , R\), where \(R\) is the total number of groups. For each group the number of latent functions within is denoted \(Q_r\), which we will also refer to as the group size, with \(\sum _{r=1}^{R} Q_r= Q\). Each group is comprised of latent functions \(\mathbf {F}_{r}= \{ \mathbf {f} _{j}\}_{j \in \text { group } r} \) and covariance between latent functions \(f_j\) and \(f_{j^\prime }\) is nonzero iff the functions \(f_j\) and \(f_{j^\prime }\) belong to the same group r.

Hence, the prior on \(\mathbf {F}\) can be expressed similarly to the generic prior defined in Eq. (1):

$$\begin{aligned} p(\mathbf {F} | \varvec{ \theta } ) = \prod _{r=1}^{R} p(\mathbf {F}_{r} | \varvec{ \theta } _{r} ) = \prod _{r=1}^{R} {\mathcal {N}}(\mathbf {F}_{r}; \mathbf {0} , \mathbf {K}_{f f}^r), \end{aligned}$$
(3)

where \(\mathbf {K}_{f f}^r\in {\mathbb {R}}^{NQ_r\times NQ_r}\) is the covariance matrix generated by the group kernel function \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )\), which evaluates the covariance of functions \(f_j\) and \(f_{j^\prime }\) at the locations \( \mathbf {x} \) and \( \mathbf {x} ^\prime \), respectively.

This structure allows arbitrary grouping of latent functions depending on the application (in our case, groups are structured for distributed forecasting, discussed below). However we emphasize that our inference method allows grouping between any latent functions in \(\mathbf {F}\) and does not make any assumptions (beyond the standard iid assumption) on the conditional likelihood. Hence, since our model allows dependences between latent functions a priori, we refer to it as grouped Gaussian processes (ggp). Although we develop a generic and efficient method for ggp models in Sect. 6, our focus in this paper is on a particular class of flexible multi-task regression models referred in the literature to as Gaussian process regression networks (gprn, Wilson et al. 2012).

4.1 Separable kernels

Before describing how gprn s fit into the framework of Krauth et al. (2017) and how we generalize them to incorporate grouped priors, it is important to describe a simple yet efficient way of modeling correlations across groups. Once latent functions are coupled a priori, scalability becomes an important consideration. Thus, although \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )\) is not constrained in terms of kernel choice, for the problem at hand we consider separable kernels of the form \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )= \kappa _{r}( \mathbf {x} , \mathbf {x} ')\kappa _{r}( \mathbf {h} _j, \mathbf {h} _{j'} )\). \( \mathbf {h} \) are defined as \(H\)-dimensioned feature vectors forming an additional feature matrix \(\mathbf {H}_r\in {\mathbb {R}}^{Q_r\times H}\) that characterizes covariance across functions \( \mathbf {f} _{j}\in \mathbf {F}_{r}\). We describe in Sect. 7 below how \(\mathbf {H}_r\) can be used to exploit spatial dependency between tasks.

This separable structure yields covariance matrices of the Kronecker form \( \mathbf {K}_{f f}^r= \mathbf {K}_{ \mathbf {h} \mathbf {h} }^r\otimes \mathbf {K}_{ \mathbf {x} \mathbf {x} }^r\), where \(\mathbf {K}_{ \mathbf {x} \mathbf {x} }^r\in {\mathbb {R}}^{N\times N}\) and \(\mathbf {K}_{ \mathbf {h} \mathbf {h} }^r\in {\mathbb {R}}^{Q_r\times Q_r}\). By adopting the Kronecker-structured prior covariance over functions within a group, we reduce the maximum dimension of required matrix inversions, allowing scalable inference.

5 Grouped Gaussian process regression networks

Wilson et al. (2012) consider the case where the (noiseless) observations are a linear combination of Gaussian processes, \(\{ g_{\ell }( \mathbf {x} ) \}\), where the coefficients, \(\{ w_{p \ell }( \mathbf {x} ) \}\), are also input-dependent and drawn from Gaussian process priors. In other words, their conditional likelihood model for a single observation at input point \( \mathbf {x} \) and task p is given by:

$$\begin{aligned} y_p( \mathbf {x} ) = \sum _{\ell =1}^{Q_g} w_{p \ell }( \mathbf {x} ) g_\ell ( \mathbf {x} ) + \epsilon _p \text {, } p=1, \ldots P\text {,} \end{aligned}$$
(4)

where \(\{w_{p \ell }( \mathbf {x} ), g_\ell ( \mathbf {x} ) \}\) are drawn from independent gp priors and \(\epsilon _p\) is a task-dependent Gaussian noise variable. This model is termed the Gaussian process regression network (GPRN) by Wilson et al. (2012) and \(\{w_{p \ell }\}\) and \(\{g_{\ell }\}\) are referred to as weight functions and node functions, respectively. It is easy to see how gprn s fit into the latent Gaussian process model formulation of Krauth et al. (2017), as described in Sect. 3.1. We simply make \(\{w_{p\ell }\}\) and \(\{g_\ell \}\) subsets of latent functions in \(\{f_j\}_{j=1}^Q\) with \(PQ_g\) weight functions and \(Q_g\) node functions so that \(Q_g(P+1) = Q\).

Given the observed data \({\mathcal {D}}\), for each latent process (over weights or node functions) we need to create as many latent variables as observations. Therefore, it is useful to conceptualize the weights as \(PQ_g\) latent variables of dimension \(N\times 1\) arranged into a tensor \(\mathbf {W}\in {\mathbb {R}}^{P\times Q_g\times N}\). Similarly, the node functions can be represented by \(Q_g\) latent variables of dimension \(N\times 1\) arranged into a tensor \(\mathbf {G}\in {\mathbb {R}}^{Q_g\times 1 \times N}\). Therefore, the conditional likelihood for input \( \mathbf {x} _{(n)}\) can be written in matrix form as

$$\begin{aligned} p( \mathbf {y} _{(n)} | \mathbf {f} _{(n)}, \varvec{ \phi } ) = {\mathcal {N}}( \mathbf {y} _{(n)}; \mathbf {W}_{(n)} \mathbf {g} _{(n)}, \varvec{\Sigma }_y) \text {,} \end{aligned}$$
(5)

where the latent functions are given by node and weight functions, i.e. \( \mathbf {f} _{(n)}= \{\mathbf {W}_{(n)}, \mathbf {g} _{(n)}\}\); the conditional likelihood parameters \(\varvec{ \phi } = \varvec{\Sigma }_y\) and \(\varvec{\Sigma }_y\) is a diagonal matrix. \(P\)-dimensional outputs are constructed at \( \mathbf {x} _{(n)}\) as the product of a \(P\times Q_g\) matrix of weight functions, \(\mathbf {W}_{(n)}\), and \(Q_g\)-dimensional vector of node functions \( \mathbf {g} _{(n)}\).

5.1 Grouping structure

Although our modeling and inference framework allows for arbitrary grouping structure, we consider a correlated prior over the rows of the weight functions for the grouped gprn, and give details of the exact setting for the solar and wind applications in Sect. 7. Figure 1 illustrates our ggp framework for the gprn likelihood.

Fig. 1
figure 1

Gaussian process regression network model where \(\mathbf {Y}\) is a linear combination of node and weight latent functions comprising \(\mathbf {F}\). In the grouped Gaussian process (ggp) framework, latent functions may be grouped arbitrarily. A grouping scheme is illustrated where weight functions in \(\mathbf {W}\) are grouped by rows (grouped functions are shown in the same shade) and given a fully-coupled prior, while node functions in \(\mathbf {G}\) are independent. Here \(N\) is the number of observations per task; \(P\) is the number of tasks; and \(Q_g\) is the group size

Naturally, the greater flexibility of our approach comes at the expense of a high time-and-memory complexity, which poses significant challenges for posterior estimation. In the following section, we develop an efficient variational inference algorithm for our ggp model that is not much more computationally expensive than the original gprn ’s. In fact, we show in our experiments in Sect. 9 that our inference method can converge faster than gprn ’s while achieving similar or better performance.

6 Inference

Our inference method is based on the generic inference approach for latent variable Gaussian process models set out by Krauth et al. (2017). This is a sparse variational method that considers the case where latent functions are conditionally independent. We adapt that framework to the more general case where latent functions covary within groups, and for our case exploit the Kronecker structures noted at Sect. 5. Since our inference method does not exploit any of the specifics of the gprn likelihood, we consistently use the general grouped prior notation defined in Sect. 4.

Under the sparse method, the prior at (3) is augmented with inducing variables, \(\{ \mathbf {u} _{ r} \}_{r= 1}^{R}\), drawn from the same gp priors as \(\mathbf {F}_{r}\) at new inducing points \(\mathbf {Z}_r\), where \(\mathbf {Z}_r\in {\mathbb {R}}^{M\times D}\) lie in the same space as \(\mathbf {X}\in {\mathbb {R}}^{N\times D}\), \(M\ll N\). Since \( \mathbf {u} _{ r} \) are drawn from the same gp priors, inducing variables within a group \(r\) are similarly coupled via \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )\) evaluated at points \(\mathbf {Z}_r\). The prior in Eq. (3) is thus replaced by

$$\begin{aligned} p( \mathbf {u} | \varvec{ \theta } )&= \prod _{r=1}^{R} p( \mathbf {u} _{ r} | \varvec{ \theta } _{r} ) = \prod _{r=1}^{R} {\mathcal {N}}( \mathbf {u} _{ r} ; \mathbf {0} , \mathbf {K}_{u u}^r) \text {,} \end{aligned}$$
(6)
$$\begin{aligned} p(\mathbf {F} | \mathbf {u} )&= \prod _{r=1}^{R} {\mathcal {N}}(\mathbf {F}_{r}; \tilde{\varvec{ \mu } }_r, {\widetilde{\mathbf {K}}}_r), \end{aligned}$$
(7)

where \(\tilde{\varvec{ \mu } }_r= \mathbf {A}_r \mathbf {u} _{ r} \), \({\widetilde{\mathbf {K}}}_r= \mathbf {K}_{f f}^r- \mathbf {A}_r\mathbf {K}_{u f}^r\) and \(\mathbf {A}_r= \mathbf {K}_{f u}^r(\mathbf {K}_{u u}^r)^{-1}= \mathbf {I}_{Qr} \otimes \mathbf {K}_{ \mathbf {x} \mathbf {z} }^r(\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r)^{-1}\). \(\mathbf {K}_{u u}^r\in {\mathbb {R}}^{MQ_r\times MQ_r}\) is the covariance matrix induced by \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )\) evaluated over \(\mathbf {Z}_r\), \(\mathbf {H}_r\), yielding the structure \(\mathbf {K}_{u u}^r= \mathbf {K}_{ \mathbf {h} \mathbf {h} }^r\otimes \mathbf {K}_{ \mathbf {z} \mathbf {z} }^r\) and importantly the decomposition \((\mathbf {K}_{u u}^r)^{-1}= (\mathbf {K}_{ \mathbf {h} \mathbf {h} }^r)^{-1}\otimes (\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r)^{-1}\). We similarly define \(\mathbf {K}_{f u}^r\) and \(\mathbf {K}_{u f}^r\) (Table 1).

Table 1 A summary of the prior covariance matrix structures for a given group r for scalable variational inference in the ggp model

6.1 Posterior estimation

The (analytically intractable) joint posterior distribution of the latent functions and inducing variables under the prior and likelihood models in Eqs. (1) and (6) is approximated via variational inference (Jordan et al. 1998). Specifically, \( p(\mathbf {F}, \mathbf {u} | \mathbf {Y} ) = p(\mathbf {F} | \mathbf {u} , \mathbf {Y} )p( \mathbf {u} | \mathbf {Y} ) \approx q(\mathbf {F}, \mathbf {u} | \varvec{\lambda }){\mathop {=}\limits ^{\text {\tiny def}}}p(\mathbf {F} | \mathbf {u} )q( \mathbf {u} | \varvec{\lambda }). \) The variational posterior \(q( \mathbf {u} | \varvec{\lambda })\) is defined as a mixture of \(K\) Gaussians (mog) with mixture proportions \(\pi _k\). We assume that \(q( \mathbf {u} | \varvec{\lambda })\) also factorizes over groups (and in the diagonal case over individual latent functions). The variational posterior is thus defined as

$$\begin{aligned} q( \mathbf {u} | \varvec{\lambda })= \sum _{k=1}^{K} \pi _k\prod _{r=1}^{r} q_k( \mathbf {u} _{ r} | \varvec{\lambda }_{k r}) \end{aligned}$$
(8)

where \(q_k( \mathbf {u} _{ r} | \varvec{\lambda }_{k r})= {\mathcal {N}}( \mathbf {u} _{ r} ; \mathbf {m} _{kr}, \mathbf {S}_{kr})\) and \(\varvec{\lambda }_{k r}= \{ \mathbf {m} _{kr}, \mathbf {S}_{kr}, \pi _k\}\). We then estimate the model by maximizing the so-called evidence lower bound (elbo), which de-constructs to \( {\mathcal {L}}_{\text {elbo}}(\varvec{\lambda }) {\mathop {=}\limits ^{\text {\tiny def}}}{\mathcal {L}}_{\text {ent}}(\varvec{\lambda }) + {\mathcal {L}}_{\text {cross}}(\varvec{\lambda }) + {\mathcal {L}}_{\text {ell}}(\varvec{\lambda }) \text {,} \) which are the entropy, cross-entropy and expected log likelihood terms, respectively. The explicit expression required for \({\mathcal {L}}_{\text {elbo}}\) is a generalization of the results in Krauth et al. (2017). For the entropy term we have that (using Jensen’s inequality):

$$\begin{aligned} {\mathcal {L}}_{\text{ ent }}(\varvec{\lambda })&= {\mathbb {E}}_{q( \mathbf {u} | \varvec{\lambda })}{\left[ \log q( \mathbf {u} | \varvec{\lambda })\right] } \nonumber \\ {}&\ge - \sum _{k=1}^{K} \pi _k \log \sum _{l=1}^{K} \pi _l {\mathcal {N}}( \mathbf {m} _{k}; \mathbf {m} _{l}, \mathbf {S}_{k} + \mathbf {S}_{l}), \end{aligned}$$
(9)

where \( \mathbf {m} _{k}\) is the vector \( \{ \mathbf {m} _{kr} \}_{r=1}^{R} = \{ \mathbf {m} _{kj} \}_{j=1}^{Q}\) and \(\mathbf {S}_{k}\) is the block diagonal matrix with diagonal elements \( \{ \mathbf {S}_{kr} \}_{r=1}^{R}\) (and equivalent for \( \mathbf {m} _{l}\), \(\mathbf {S}_{l}\)). For the cross-entropy and the expected log likelihood terms:

$$\begin{aligned} {\mathcal {L}}_{\text{ cross }}(\varvec{\lambda })&= {\mathbb {E}}_{q( \mathbf {u} | \varvec{\lambda })}{\left[ \log p( \mathbf {u} | \varvec{ \theta } )\right] } \nonumber \\&= {-}\frac{1}{2} \sum _{k=1}^{K} \pi _k\sum _{r=1}^{R} [M_{r} \log (2\pi ) {+} \log \left| \mathbf {K}_{u u}^r\right| {+} \mathbf {m} _{kr}'(\mathbf {K}_{u u}^r)^{-1} \mathbf {m} _{kr} {+} \text {tr }((\mathbf {K}_{u u}^r)^{-1} \mathbf {S}_{kr}) ] \end{aligned}$$
(10)
$$\begin{aligned} {\mathcal {L}}_{\text {ell}}(\varvec{\lambda })&= {\mathbb {E}}_{q(\mathbf {F}| \varvec{\lambda })}{\left[ \log p(\mathbf {Y} | \mathbf {F}, \varvec{ \phi } ))\right] } = \sum _{n=1}^{N} {\mathbb {E}}_{q_{(n)}( \mathbf {f} _{(n)}| \varvec{\lambda })}{\left[ \log p( \mathbf {y} _{(n)} | \mathbf {f} _{(n)}, \varvec{ \phi } )\right] } \end{aligned}$$
(11)

where \(q(\mathbf {F}| \varvec{\lambda })\) results from integration of the joint approximate posterior over inducing variables \( \mathbf {u} \). Note that \( \text{ tr } ((\mathbf {K}_{u u}^r)^{-1})\) factorizes as \( \text{ tr } ((\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r)^{-1}) \text{ tr } ((\mathbf {K}_{ \mathbf {h} \mathbf {h} }^r)^{-1})\) and \(\ln \left|\mathbf {K}_{u u}^r\right|\) factorizes as \(Q_r\ln \left|\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r\right| + M\ln \left|\mathbf {K}_{ \mathbf {h} \mathbf {h} }^r\right|\). Given factorization of the joint and variational posteriors over k and \(r\) and standard conjugacy results, we have

$$\begin{aligned} q(\mathbf {F}| \varvec{\lambda })&= \sum _{k=1}^{K} \pi _k\prod _{r=1}^{R} {\mathcal {N}}( \mathbf {b} _{kr}, \varvec{\Sigma }_{kr} ), \nonumber \\ \mathbf {b} _{kr}&= \mathbf {A}_r \mathbf {m} _{kr}, \quad \text {and} \quad \varvec{\Sigma }_{kr} = {\widetilde{\mathbf {K}}}_r+ \mathbf {A}_r\mathbf {S}_{kr}\mathbf {A}_r' \end{aligned}$$
(12)

The distribution \(q_{(n)}( \mathbf {f} _{(n)}| \varvec{\lambda })\) similarly factorizes as

$$\begin{aligned} q_{(n)}( \mathbf {f} _{(n)}| \varvec{\lambda })&= \sum _{k=1}^{K} \pi _kq_{k(n)}( \mathbf {f} _{(n)}| \varvec{\lambda }_k )\nonumber \\&= \sum _{k=1}^{K} \pi _k\prod _{r=1}^{R} {\mathcal {N}}( \mathbf {b} _{kr(n)}, \varvec{\Sigma }_{kr (n)} ) \text {.} \end{aligned}$$
(13)

\({\mathcal {L}}_{\text {ell}}\) may be estimated by Monte Carlo, requiring sampling only from \(Q_r\)-dimensional multivariate Gaussians \({\mathcal {N}}( \mathbf {f} _{r(n)}; \mathbf {b} _{kr(n)}, \varvec{\Sigma }_{kr (n)} )\) where \( \mathbf {b} _{kr(n)}\) is the vector comprised of every nth element of \( \mathbf {b} _{kr}\), and \(\varvec{\Sigma }_{kr} \) is the (full) \(Q_r\times Q_r\) matrix comprised of nth diagonal elements of posterior covariance \(\varvec{\Sigma }_{k jj'}\) sub-matrices of \(\varvec{\Sigma }_{kr}\). Thus, we estimate

$$\begin{aligned} \widehat{{\mathcal {L}}}_{\text {ell}}= \frac{1}{S} \sum _{n=1}^{N} \sum _{k=1}^{K} \pi _k\sum _{s=1}^{S} \ln p( \mathbf {y} _{(n)} | \mathbf {f} _{(n)}^{(k,s)} ) \text {.} \end{aligned}$$
(14)

Under the separable structure adopted, each mixture component covariance for \( \mathbf {f} _{r(n)}\), \(\varvec{\Sigma }_{kr (n)}\) can be seen to consist of structure arising from the grouped prior plus a term arising from the variational posterior:

$$\begin{aligned} \varvec{\Sigma }_{kr (n)}&= {\widetilde{\mathbf {K}}}_{r(n)}+ \mathbf {A}_{r(n)}\mathbf {S}_{kr}\mathbf {A}_{r(n)}^{\prime }, \quad \text {where} \nonumber \\ {\widetilde{\mathbf {K}}}_{r(n)}&= \mathbf {K}_{ \mathbf {h} \mathbf {h} }^r\times \left[ \kappa _{r}( \mathbf {x} _{(n)}, \mathbf {x} _{(n)}- \kappa _{r}( \mathbf {x} _{(n)},\mathbf {Z}_r)\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r\kappa _{r}(\mathbf {Z}_r, \mathbf {x} _{(n)})\right] \quad \text {and} \nonumber \\ \mathbf {A}_{r(n)}&= \left[ \mathbf {I}_{Qr} \otimes \kappa _{r}( \mathbf {x} _{(n)},\mathbf {Z}_r)(\mathbf {K}_{ \mathbf {z} \mathbf {z} }^r)^{-1}\right] \end{aligned}$$
(15)

Thus cross-function covariance within a group will be either driven by the prior, where \(\mathbf {S}_{kr}\) is diagonal, or more flexible in form where \(\mathbf {S}_{kr}\) is non-diagonal.

6.2 Prediction

Prediction for a new point \( \mathbf {y} _{\star }\) given \( \mathbf {x} _{\star }\) is taken as the expectation over the general posterior distribution for the new point:

$$\begin{aligned} p( \mathbf {y} _{\star } | \mathbf {x} _{\star } )&=\int p( \mathbf {y} _{\star } | \mathbf {f} _{\star } )q( \mathbf {f} _{\star }| \varvec{\lambda })\text {d}\mathbf {F}\nonumber \\&= \sum _{k=1}^{K} \pi _k\int p( \mathbf {y} _{\star } | \mathbf {f} _{\star } )q_k( \mathbf {f} _{\star }| \varvec{\lambda }_k)\text {d}\mathbf {F}_{\star }, \end{aligned}$$
(16)

where \(q_k( \mathbf {f} _{\star }| \varvec{\lambda }_k)\) is defined as for \(q_{k(n)}( \mathbf {f} _{(n)}| \varvec{\lambda }_k )\) in (13). Given the explicit expression for the posterior distribution, the expectation in Eq. (16) is estimated by sampling:

$$\begin{aligned} {\mathbb {E}}_{p( \mathbf {y} _{\star } | \mathbf {x} _{\star } )}{\left[ \mathbf {y} _{\star }\right] }&\approx \frac{1}{S} \sum _{s=1}^{S} \mathbf {W}_{\star }^{s} \mathbf {g} _{\star }^{s}, \end{aligned}$$
(17)

where \(\{ \mathbf {W}_{\star }^{s}\text {,} \mathbf {g} _{\star }^{s} \} = \mathbf {f} _{\star }^{s}\) are samples from \(q_k( \mathbf {f} _{\star }| \varvec{\lambda }_k)\).

6.3 Complexity

Under the ggp with a Kronecker-structured prior the time complexity per iteration changes slightly from the independent function case. For the same \(P, Q_g\) and \(M\), fewer \(M\)-dimensioned inversions are required for ggp versus gprn, without any increase in maximum dimension under the Kronecker specification assuming \(M\ge Q_r\). This represents a substantial reduction in \(M\)-dimensioned inversions, depending on the grouping scheme.

The cost of calculating \({\mathcal {L}}_{\text {cross}}\) is dominated by the cost of inversions, being \({\mathcal {O}}\left( \sum _{r=1}^{R} (M^{3}+Q_r^{3}) \right) \) for the grouped case and \({\mathcal {O}}\left( QM^{3} \right) \) for the independent case. Under the diagonal posterior specification, \(\mathbf {S}_{k}\) in Eq. (9) reduces to the same form as the independent case of Krauth et al. (2017). Lastly, \({\mathcal {L}}_{\text {ell}}\) under the grouped structure requires sampling from low-dimensional \(Q_r\times Q_r\) multivariate Gaussians with non-diagonal posterior covariance matrices, whereas this is avoided under the independent framework. However, the low dimensionality (number of tasks in our empirical evaluation) involved yields minimal additional cost.

7 Grouped Gaussian processes for spatially dependent tasks

It is natural to consider a multi-task framework in a spatio-temporal setting such as distributed solar forecasting, where power output at solar sites in a region would a priori be expected to covary over time and space. Given the expectation of spatially-driven covariance across sites, i.e. tasks, we seek to exploit this structure to increase both efficiency and accuracy of multi-task forecasts. Our approach does this by incorporating explicit spatial dependencies between latent functions in the model.

Latent functions in the general framework do not necessarily map to a particular task. The question therefore arises as to how to use spatial information relating to tasks to structure covariance between latent functions. We solve this by setting \(Q_g= P\) and grouping latent functions within rows of \(\mathbf {W}\) i.e. \( \mathbf {f} _{j}\in \mathbf {W}_{i:}, i=1,\dots P\). We then define a feature matrix \(\mathbf {H}_r\) that governs covariance across the \(P\) functions in each row (Fig. 1). With \(Q_g\ge P\) it is possible to obtain a very general representation of the multi-task process with full mixing between tasks via \(\mathbf {G}\), which now contains \(P\) node functions. This grouping structure allows parameters to vary across tasks, and at the same time, the coupled prior can act to regularize latent function predictions.

Model settings In our setting, we consider each latent process in \(\mathbf {G}\) to be an independent gp, i.e., \(\left\langle \mathbf {g} _j, \mathbf {g} _{j'} \right\rangle = 0\) for \(j \ne j^\prime \). Furthermore, input features of \( \mathbf {g} _j, j= 1,\dots ,P\) are defined to be task features i.e. features for \( \mathbf {g} _j\) relate to task j, specifically lagged-target values for j.

We define spatial features \( \mathbf {h} _j = (latitude_{j}, longitude_{j})\) governing weightings applied to node functions. For a given task i, \( {\mathbb {E}}_{p(\mathbf {Y} | \mathbf {F}, \varvec{ \phi } )}{\left[ \mathbf {y} _{(n) i}\right] } = \mathbf {w} _{(n)i} \mathbf {g} _{(n)}\) where \( \mathbf {w} _{(n)i}\) denotes the ith row vector of \(\mathbf {W}_{(n)}\). It can be seen that, in addition to depending on input features \( \mathbf {x} _{(n)}\), relative weights placed on node functions are now smoothed by spatial covariance imposed over the weights in \( \mathbf {w} _{(n)i}\). This allows site-by-site optimization of spatial decay in (cross-task) weights in addition to site-specific parameterization and features in \( \mathbf {w} _{(n)i}\). In total, this grouping structure yields \(2P\) groups: \(P\) groups of size \(P\) (corresponding to \(\mathbf {W}\)) and \(P\) groups of size 1 (corresponding to \(\mathbf {G}\)).

Kernels and features for \(\kappa _{r}( \mathbf {x} , \mathbf {x} ')\) and \(\kappa _{r}( \mathbf {h} _j, \mathbf {h} _{j'} )\) are selected in line with previous studies relating to multi-task distributed forecasting (Inman et al. 2013; Dahl and Bonilla 2017). In particular, for our task of forecasting distributed solar output over time, for \( \mathbf {g} _j, j= 1,\dots ,P\), we define \(\kappa _{ \mathbf {g} _j} ( \mathbf {x} _t, \mathbf {x} _s)= \kappa _{ \mathbf {g} _j} ({\varvec{l}}_t,{\varvec{l}}_s)\) as a radial basis function kernel (\(\kappa _{RBF}\)) applied to a feature vector of recent lagged observed power at site j, i.e. for site j at time t, \({\varvec{l}}_{j,t} = (y_{j,t}, y_{j,t-1}, y_{j,t-2})\).

For row-group r, we define a separable, multiplicative kernel structure as discussed above, i.e. \(\kappa _{r}(f_j( \mathbf {x} ), f_{j'}( \mathbf {x} ^\prime ) )= \kappa _{r}( \mathbf {x} , \mathbf {x} ')\kappa _{r}( \mathbf {h} _j, \mathbf {h} _{j'} )\). We set the kernel over the inputs as \(\kappa _{r}( \mathbf {x} , \mathbf {x} ')= \kappa _{Per.}(t,s)\kappa _{RBF}({\varvec{l}}_{rt},{\varvec{l}}_{rs})\), where \( \kappa _{Per.}(t,s)\) is a periodic kernel on a time index t capturing diurnal cyclical trends in solar output.

We adopt a compact kernel over functions, specifically a separable rbf- Epanechnikov structure, i.e., \(\kappa _{r}( \mathbf {h} _j, \mathbf {h} _{j'} )= \kappa _{RBF}( \mathbf {h} _j, \mathbf {h} _{j^\prime }) \kappa _{Ep.}( \mathbf {h} _j, \mathbf {h} _{j^\prime })\), \( j,j^\prime =1 \ldots P\), where \( \mathbf {h} _j = (latitude, longitude)\) for site j. By using a more flexible compact kernel, we aim to allow beneficial shared learning across tasks while reducing negative transfer by allowing cross-function weights to reduce to zero at an optimal distance.

8 Experiments

We evaluate our approach on forecasting problems for distributed residential solar installations and wind speed measured at proximate weather stations.

8.1 Solar forecasting

The task for solar is to forecast power production 15 min ahead at multiple distributed sites in a region. Data consist of 5 min average power readings from groups of proximate sites in the Adelaide and Sydney regions in Australia. We present results for three datasets: ten Adelaide sites (adel-autm) and twelve Sydney sites (syd-autm), both over 60 days during Autumn 2016, and ten Adelaide sites (adel-summ) over 60 days in Spring–Summer 2016. We train all models on 36 days of data, and test forecast accuracy for 24 subsequent days (days are defined as 7 am to 7pm). In all, for each site, we have 5000 datapoints for training and 3636 datapoints for testing.

Datasets have varying spatial dispersions. adel-autm (adel-summ) sites are spread over an approximately \(30 \times 40\;(20 \times 20)\) km area, while syd-autm sites are evenly dispersed over an approximately \(15 \times 20\) km area.

8.1.1 Benchmark models

We compare forecast performance of our ggp method to the fully independent gprn and several other benchmark models. We estimate (1) separate independent gp forecast models for each site (igp), (2) pooled multi-task models with task-specific (spatial) features for sites (mtg), and (3) multi-task linear coregional models (lcm). The final benchmark model (4) is the gprn with independent latent functions (Wilson et al. 2012).

These models can be expressed in terms of the general latent function framework with differing values of \(P\), \(Q\) and \(R\), and different likelihood functions. As discussed in Sect. 5, where latent functions are independent, group size is equal to 1 and \(R= Q\). Key model constants are presented at Table 3.

Both igp and mtg models have a standard, single task Gaussian likelihood functions, while the lcm model is comprised of \(P\) node functions mapped to outputs via a \(P\times Q_g\) matrix of deterministic weights, i.e. \(p( \mathbf {y} _{(n)} | \mathbf {f} _{(n)}, \varvec{ \phi } ) = {\mathcal {N}}( \mathbf {y} _{(n)}; \mathbf {W}_{(n)} \mathbf {g} _{(n)}, \varvec{\Sigma }_y)\) where \(\mathbf {W}_{(n)ij}=w_{ij} \quad \forall n=1,\dots , N\) and \(Q_g=P\). Kernels for all models are presented at Table 2. We maintain similar kernel specification across models. All kernels are based on the specification described at Sect. 7.

Table 2 Latent function kernel specifications for ggp and benchmark models

Models are presented for diagonal and full mog posterior specifications, with \(K=1\). In the case of the ggp, to maintain the scalable specification, we adopt a Kronecker construction of the full posterior for each group \(R\) in line with the prior specification.

To compare model performance under equivalent settings, we consider the complexity of the different approaches and standardize model settings by reference to a consistent target computational complexity per iteration. In our variational framework, the time complexity is dominated by algebraic operations with cubic complexity on the number of inducing inputs \(M\). We therefore set \(QM^3 = RM^3 = 20 \times (200)^3\) for adel-autm and adel-summ models, \(QM^3 = RM^3 = 24 \times (200)^3\) for syd-autm, and adjust the number of inducing points, \(M\), accordingly (Table 3).

Table 3 Key constants for ggp and benchmark models

8.1.2 Experiment settings and performance measures

All models are estimated based on the variational framework explained in Sect. 6. We optimize the elbo iteratively until its relative change over successive epochs is less than \(10^{-5}\) up to a maximum of 200 epochs. Optimization is performed using adam (Kingma and Ba 2014) with settings \(\{LR = 0.005; \beta _{1}=0.09; \beta _{2}=0.99 \}\). All data except time index features are normalized prior to optimization. Reported forecast accuracy measures are root mean squared error (rmse) and negative log predictive density (nlpd). The non-Gaussian likelihood of gprn models makes the usual analytical expression for nlpd intractable. We therefore estimate it using Monte Carlo:

$$\begin{aligned} \text {{nlpd}} =- {\mathbb {E}}_{q( \mathbf {f} _{\star }| \varvec{\lambda })}{\left[ \ln p( \mathbf {y} _{\star } | \mathbf {f} _{\star } )\right] } \approx - \frac{1}{S} \sum _{s=1}^{S} \ln {\mathcal {N}}( \mathbf {y} _{\star }; \mathbf {W}_{\star }^{s} \mathbf {g} _{\star }^{s}, \varvec{ \phi } ) \text {,} \end{aligned}$$

where \(\mathbf {W}_{\star }^{s} \mathbf {g} _{\star }^{s}\) are draws from their corresponding posterior over \( \mathbf {f} _{\star }\). In addition, we compute average ranking (m-rank) over both accuracy measures (rmse and nlpd), and mean forecast variance (f-var), which is critical to the use of short term forecasts as inputs to system or market management algorithms.

Table 4 Forecast accuracy and variance of ggp and benchmark models using diagonal (D) and full (F) Gaussian posteriors

8.1.3 Results

Results for solar models are presented at Table 4 with diagonal and full-Gaussian posterior specifications. ggp maintains or improves point accuracy when compared to best performing benchmarks on both rmse and nlpd individually. For rmse, accuracy under ggp differs by less than 1% relative to gprn, and similarly matches or improves on nlpd relative to lcm and other benchmarks. ggp performs strongly in terms of overall accuracy across both measures, consistently achieving the highest average rank across both measures (m-rank). In contrast, competing baselines either perform well on rmse at the expense of poor performance under nlpd or vice versa.

The benefit of regularization under the ggp is clear when considering mean forecast variance, which is lower under ggp than all benchmark models for all experiments. Compared to the un-grouped gprn (lcm), variance for solar forecasts is reduced by 18 to 24 (13 to 40)% under the most accurate ggp model.

We test statistical significance of differences in performance discussed above via 95% intervals estimated by Monte Carlo.Footnote 1 Results of the analysis show that rmse under gprn is statistically significantly lower than under ggp for solar datasets. In all other cases, rmse is either not significantly different or significantly higher than under ggp. Results are similar for nlpd, which is statistically significantly lower under lcm for two of three datasets, and otherwise higher or not significantly different.

With the exception of the mtg model, all multi-task models consistently improve on the naive independent forecast models. Figure 3 illustrates the benefit observed under the ggp (and other multi-task models) in reducing large forecast errors associated with variable weather conditions.

8.2 Wind speed forecasting

Wind variability shares characteristics with solar variability, as discussed in Sect. 1, with similar approaches applied to the problem of short term forecasting (Widén et al. 2015). We test our ggp method forecasting ground wind speed 30 min ahead at six weather stations in Victoria, Australia, within an approximately \(30 \times 40\) km area. Data are half-hourly wind speed readings collected over an eight month period. The wind data present an interesting challenge, with frequent missing and noisy observations (Fig. 2). After filtering, we have 4000 training points and 1024 test points per station.

We adopt the same kernel and feature definitions as for solar (Table 2) however use a different grouping structure for the ggp. We allow functions on the diagonal of \(\mathbf {W}\) to be independent and group off-diagonal functions within each row. This structure for each task j allows weight placed on its ‘own’ univariate node function \( \mathbf {g} _j\) to be independent of weights placed over remaining sites, which are still spatially smoothed. Wea similarly adjust the number of inducing points, \(M\) to test models under equivalent settings, specifically setting \(QM^3 = RM^3 = 18 \times (200)^3\).

Fig. 2
figure 2

Example data for normalized solar power (adel-summ—left hand side) and normalized wind speed (wind—right hand side). Wind data exhibit strong noise relative to Summer time solar data

Fig. 3
figure 3

Mean squared error under the ggp and igp approaches for adel-autm. Results shown for a single day with variable cloud cover causing high variability in power output

Results for wind are presented at Table 4 with diagonal and full-Gaussian posterior specifications. On this dataset ggp outperforms all other models on all measures including point accuracy (nlpd and rmse), overall accuracy as measured by mean model ranking across both rmse and nlpd, and forecast variance. Consistent reductions in variance are observed for the wind dataset, ranging from 7 to 25% improvements over competing models. As for solar, confidence intervals are constructed via Monte Carlo. For wind, all differences in model performance are confirmed to be statistically significant.

Comparison to approach of Remes et al. (2017) In addition to the above benchmarks, estimated using the generic sparse, variational inference framework, we also consider the approach of Remes et al. (2017). Since this method is a variational approach with complexity of \({\mathcal {O}}\left( (QN)^{3} \right) \), which does not use inducing points, in order to fit a model under equivalent complexity conditions, we take a subset of the training data such that \((QN)^{3}\) approximates the settings above. We estimate a model for the wind dataset, which has a manageable number of tasks. We set \(Q=2\). Equivalent complexity would imply \(N=66\) for wind, however we limit the minimum data size at \(N=200\).

We utilize the model implementation made available by the authors and allow all parameters to be optimizedFootnote 2 The model gradient was optimized over 50 iterations, repeated ten times using different random parameter initial values. The model with the best performance (lowest objective function) was used to generate predictions.

The estimated value for rmse was 0.88 for the wind dataset, significantly higher than results under the ggp.

9 Timed experiments

To further examine the properties of the ggp model in relation to existing scalable multi-task methods, we conduct a series of timed experiments. We re-estimate models for the same forecasting problems as presented at Sect. 8 and, for each epoch completed, capture time and performance measures at that point. The goal of the analysis is to evaluate time taken for the ggp approach to achieves gains in forecast accuracy relative to the independent gprn and other benchmarks, as well as final forecast performance attained upon completion.

We reiterate that, as for experiments presented at Sect. 8, the number of inducing points for each model is set to approximately standardize computational complexity per iteration.

All models are estimated on a multi-GPU machine with four NVIDIA TITAN Xp graphics cards (memory per card 12 GB; clock rate 1.58 GHz). Experiments were run until either convergence criteria were reached (see Sect. 8), or to a maximum of 500 epochs or 300 min runtime (these constraints were set conservatively based on previous experimental results). Starting values for common components were set to be equal across all models. Optimization settings were as for all other experiments.

9.1 Results of timed experiments

Representative rates of improvement in performance measures over time are presented at Fig. 4 for two datasets, wind and adel-summ. These datasets were selected since results for adel-autm and syd-autm were similar to those for wind. Results are shown for all multi-task benchmark approaches with full variational posterior specifications (similar results were obtained for the diagonal posterior setting). Performance metric values shown are recorded at the end of each epoch (hence the first value for each model is recorded at different times, being the time taken to estimate the initial epoch) and adjusted for calculation time for performance capture.

Fig. 4
figure 4

Forecast accuracy for wind (a) and adel-summ (b) datasets over optimization time in minutes for all multi-task benchmark models. rmse and nlpd are recorded after each epoch during optimization

For performance at a given point in time, results suggest a consistent ranking across models tested. We observed that ggp achieves higher forecast accuracy significantly faster than gprn in the majority of cases, with a few cases performing similarly to gprn. Specifically, for all datasets except adel-summ, rmse reduces significantly faster under the ggp method relative to the gprn, and nlpd for the ggp surpasses gprn relatively early in the optimization. Relative rates of improvement in rmse and nlpd as shown for wind at Fig. 4 provide a typical example of the performance difference between the two models.

In terms of final accuracy, over the four datasets considered, results confirm the general rankings of final model accuracy as shown in Table 4. Specifically, gprn achieves the best accuracy in terms of rmse in two of four cases (adel-autm and adel-summ), while ggp improves on gprn in two cases (syd-autm and wind, with mtg performing best on syd-autm). Considering both speed and final accuracy together, ggp dominates the gprn, achieving lower rmse and nlpd in a significantly shorter time than gprn in the majority of cases. In some cases, gprn after some time will overtake ggp to achieve a slightly better result on rmse, however in no case achieves a better result on nlpd. The results of timed experiments are therefore consistent with an improvement over the gprn in terms of speed of convergence without loss of accuracy in terms of nlpd, and minor loss of accuracy in terms of rmse.

With respect to the lcm and mtg models, these methods achieve improvements in forecast accuracy significantly more quickly than the ggp and gprn. Consistent with results shown at Table 4, lcm achieves lower or similar nlpd to the ggp, with ggp outperforming lcm in two of four cases (syd-autm and wind). However, we note that as show in Fig. 4, the lcm converges relatively prematurely, and never achieves ggp or gprn performance on rmse. A similar phenomenon was observed to a greater degree for the mtg, which converges quickly but achieves poor accuracy relative to other models on both rmse and nlpd, the exception being rmse for the syd-autm dataset.

Across the datasets considered, the ggp approach tends to achieve better forecast accuracy than the lcm where data are noisier, consistent with improved accuracy where the data require a more expressive model than the (fixed-weight) lcm approach. For example, the adel-summ dataset has significantly less noise relative to Autumn and wind datasets (Fig. 2). Consequently, gprn, lcm and ggp all perform similarly for the adel-summ dataset in terms of final accuracy, but lcm is significantly faster, suggesting there is little advantage from a more costly, expressive model such as gprn or ggp. In contrast, for noisier datasets, ggp and gprn continue to improve over lcm, and ggp does so at a faster rate than the gprn. Figure 4 illustrates the typical relative accuracy over time of multitask models.

10 Discussion

We have proposed a general multi-task gp model, where groups of functions are coupled a priori. Our approach allows for input-varying covariance across tasks governed by kernels and features and, by building upon sparse variational methods and exploiting Kronecker structures, our inference method is inherently scalable to a large number of observations.

We have shown the applicability of our approach to forecasting short term distributed solar power and wind speed at multiple locations, where it matches or improves point forecast performance of single-task learning approaches and other multi-task baselines under similar computational constraints while improving quantification of predictive variance. We have also demonstrated that our approach can yield important reductions in time taken to achieve the same accuracy relative to the equivalent model without coupled priors. In general, the ggp strikes a balance between flexible, task-specific parameterization and effective regularization via structure imposed in the prior.

While we focus on a priori spatial dependencies, we emphasize that other grouping structures and kernels, likelihood functions or applications are possible. For example, non-spatial covariates in other domains, or grouping of functions according to clusters of tasks, could be adopted.