1 Introduction

Solar power forecasting is motivated by several areas of application, including grid management, load shifting (demand management) and energy storage management. As small scale residential solar penetration grows, challenges to forecasting power for multiple distributed small scale sites, in particular forecasting with incomplete site information and noisy power data, become of interest.

Challenges in this context include nonstationarity in the data,Footnote 1 and developing useful probabilistic forecasts. Many forecasting methods assume it is possible to detrend power data prior to stochastic modelling in order to ‘flatten’ the data and remove diurnal cyclical trends associated with cycles in solar radiation. However, methods to do so rely on comprehensive site information [7], or site history as in [5, 17]. Overall, existing methods have high data demands, constraining their usefulness for new or unseen sites.

For certain applications it is desirable to work with a probabilistic distribution of forecasts that quantifies forecast uncertainty. Statistically-based methods, such as vector autoregressive (VAR) models, typically allow probabilistic forecasts however are constrained in their application to unflattened data and sites for which no training data is available. Machine learning methods such as neural networks (ANNs) are more widely applicable but do not generally allow probabilistic forecasts. Gaussian process models are advantageous in this regard, providing a flexible, nonparametric forecast approach that is also probabilistic in nature.

Transfer learning over distributed sites may assist in addressing site data limitations as well as improve prediction of weather-related power fluctuations. The literature suggests cross site data can be helpful in modelling cloud conditions to improve site level forecasts, as in [3, 12, 23], with evidence that cross site information in a dense network can be relevant from timescales of a few minutes, as in [27], to multi-hour horizons in a widely distributed network, as in [3].

A key constraint often associated with multisite approaches is scalability to large numbers of sites. Within the Gaussian process literature, several approximate methods have been developed that support stochastic parameter optimisation, thus maintaining scalability to large datasets and feasibility for real world application.

The current study considers the problem of short term (less than 30 min) power forecasting for large distributed networks of residential rooftop solar systems. We apply sparse variational Gaussian process (gp) approaches for probabilistic forecasting across multiple solar sites in Adelaide, Australia. Our aim is to test whether scalable gp methods can be applied to short term distributed forecasting to provide useful, probabilistic forecasts at the site level with limited site history and information.

1.1 Related Work

The literature around solar forecasting is extensive, including studies that investigate both solar irradiance and power forecasting over multiple forecast horizons (a few minutes to multiple days) using approaches that range from physics-based models to statistical and machine learning methods. Studies to date also examine multiple inputs including irradiance or power measurements, ground and satellite based weather data and meteorological forecasts. Several reviews [10, 14, 18, 25] provide a thorough coverage of recent methods.

In the sphere of short term forecasting, stochastic models utilising only historical power data have been shown to perform relatively well in the past [18], although recent advances suggest highly accurate forecasts can be produced by including comprehensive climate data [16]. Predominant statistical methods include adaptively estimated VAR, autoregressive integrated moving average (ARIMA) and generalised autoregressive conditional heteroskedasticity (GARCH) models, for example as in [3, 9]. These methods allow for weather-related nonstationarity and at shorter horizons (up to one hour) have been found to be competitive in forecasting clearness indices (i.e. flattened irradiance data) [9, 15]. In a number of cases models are applied in a multisite setting as in [3, 5, 11, 27].

The major machine learning methods explored for short term solar forecasting are neural networks and support vector machines (SVMs). Recent examples of ANNs for short term horizons include [12, 20]. ANNs have been explored in a multisite setting in irradiance forecasting [25], although at time of writing no examples were identified of multivariate prediction at horizons less than one hour ahead.

Gaussian process and related models have been explored to a limited extent in solar forecasting. [9, 15] include univariate gp models applied to clearness indices as comparative models. [4] also uses a gp model to forecast clearness index values over an irradiance field in 30 min increments. The authors apply probabilistic principal components dimension reduction to improve feasibility of real time adaptive gp modelling over multiple locations (by reducing the number of ‘locations’ for which a gp is estimated), and further assume independence between models, thus respecifying the multivariate problem as several univariate problems. However, even with these adaptations, scalability is still constrained by non-stochastic optimisation of the exact gp as described in Sect. 2.

Several studies use a closely related method, kriging, to predict clearness indices in a multisite setting [2, 22, 23, 26]. Building on [22, 26] develops one-hour-ahead clearness index forecasts using one month of hourly data from a group of 10 meteorological stations in Singapore. In [23], the authors ‘nowcast’ clearness index values for 25 sensor locations covering an approximately 30 km radius area in Osaka.

2 Theory

Gaussian process (gp) models provide a flexible nonparametric Bayesian approach to machine learning problems such as regression and classification [21] and have proved successful in various application areas involving spatio-temporal modeling [8]. Formally, a gp is a prior over functions for which every subset of function values \(f( \mathbf {x} _1), \ldots , f( \mathbf {x} _n)\) follows a Gaussian distribution. We denote a function drawn from a gp with mean function \(\mu ( \mathbf {x} )\) and covariance function \(\kappa (\cdot , \cdot )\) by \(f( \mathbf {x} ) \sim \mathcal {GP}(\mu ( \mathbf {x} ), \kappa ( \mathbf {x} , \mathbf {x} ^\prime ) )\).

One of the most widely used gp models is the standard regression setting with a zero-mean gp and i.i.d. Gaussian noise:

$$\begin{aligned} y_{t} \sim \mathcal {N}(f ( \mathbf {x} _t), \sigma ^2_y) \text { with } f( \mathbf {x} _t) \sim \mathcal {GP}( \mathbf {0} , \kappa ( \mathbf {x} _t, \mathbf {x} _{t^{\prime }} )) \text {,} \end{aligned}$$
(1)

where \( \mathbf {x} _t\) denote features at time t and \(\sigma ^2_y\) is the noise variance.

Given a set of observations \(\{ ( \mathbf {x} _t, y_t) \}_{t=1}^{N}\), we wish to learn a model in order to make predictions at a new datapoint \( \mathbf {x} _*\). Given the likelihood and prior models in Eq. (1), the predictive distribution over \( f( \mathbf {x} _*)\) is a Gaussian with mean and variance given by:

$$\begin{aligned} \mu _{*} = \kappa ( \mathbf {x} _*, \mathbf {X}) ( \mathbf {K} + \sigma ^2_y\mathbf {I})^{-1} \mathbf {y} \text {,} \quad \sigma _{*} = \kappa ( \mathbf {x} _*, \mathbf {x} _*) - \kappa ( \mathbf {x} _*, \mathbf {X}) ( \mathbf {K} + \sigma ^2_y\mathbf {I})^{-1} \kappa (\mathbf {X}, \mathbf {x} _*) \text {,} \end{aligned}$$

where \(\mathbf {X}\) and \( \mathbf {y} \) denote all the training features and outputs, respectively; \(\mathbf {K}\) is the covariance matrix induced by evaluating the covariance function at all training datapoints; and \(\mathbf {I}\) is the identity matrix.

Although computing the exact predictive distribution above is appealing from the theoretical perspective and in a small-data regime, these computations become unfeasible for large datasets as their time and space complexity are \(\mathcal {O}(N^3)\) and \(N^2\) respectively.

Much of the research efforts in gp models have been devoted to this issue [19] with significant breakthroughs achieved over the last few years [13, 24]. Indeed, here we study the variational approach to inference in gp models, which relies upon reformulating the prior via the so-called inducing variables [24].

2.1 Scalable Gaussian Process Regression via Variational Inference

Full details of the variational approach to scalable gp regression is out of the scope of this paper and we refer the reader to [6, 13, 24] for further reference. Here it suffices to explain that we introduce a set of M inducing variables \( \mathbf {u} = (u_1, \ldots , u_M)\), which lie in the same space as the original function values and are drawn from the same gp prior. For these inducing variables we have their corresponding inputs \(\mathbf {Z}= ( \mathbf {z} _1, \ldots , \mathbf {z} _M)\), where each \( \mathbf {z} _j\) is a D-dimensional vector in the same space as the original features \( \mathbf {x} \).

The variational approach to gp inference involves a reformulation of the prior via the inducing variables and the proposal of an approximate posterior over these using \( q( \mathbf {u} ) = \mathcal {N}( \mathbf {m} , \mathbf {S}) \text {,} \) which is estimated via the optimization of the so-called evidence lower bound (elbo):

$$\begin{aligned} \mathcal {L}_{\text {elbo}}( \mathbf {m} , \mathbf {S}) = \mathrm {KL}(q( \mathbf {u} ) ||p( \mathbf {u} )) - \mathbb {E}_{q( \mathbf {f} )}{\left[ \log p( \mathbf {y} | \mathbf {f} )\right] } \text {,} \end{aligned}$$
(2)

where \(\mathrm {KL}(q ||p)\) denotes the Kullback-Leibler divergence between distributions q and p; \(p( \mathbf {u} ) = \mathcal {N}( \mathbf {0} , \kappa (\mathbf {Z}, \mathbf {Z}))\) is the Gaussian prior over the inducing variables; \( \mathbb {E}_{q( \mathbf {f} )}{\left[ \log p( \mathbf {y} | \mathbf {f} )\right] } \) is the expectation of the conditional likelihood (given in Eq. (1)) over \( q( \mathbf {f} ) = \int _{ \mathbf {u} } q( \mathbf {u} ) q( \mathbf {f} | \mathbf {u} ) d \mathbf {u} \text {;} \) and \(q( \mathbf {u} )\) the approximate posterior given above. Using simple properties of the Gaussian distribution it is possible to show that Eq. (2) can be solved analytically and, more importantly, \(\mathcal {L}_{\text {elbo}}\) decomposes as a sum of objectives over the training data. This readily allows the application of stochastic optimization methods rendering the time and space complexity of the algorithm as \(\mathcal {O}(M^3)\) and \(\mathcal {O}(M^2)\), respectively, hence independent of N and applicable to very large datasets.

2.2 Gaussian Processes for Solar Power Forecasting

A key advantage of gp models is their flexibility to express potentially nonlinear relationships and nonstationary processes through various kernel forms. gp models have the capacity to account for nonstationarity associated with diurnal cycles through appropriate kernel functions. Further, their nonparametric nature allows models to flexibly reflect variable volatility i.e. nonstationarity associated with weather effects.

In the present study, we propose several Gaussian process model specifications for application to the residential solar forecasting problem where site information is unknown. Kernels are structured to capture both cyclical and autoregressive processes in the power data. We compare results under both ‘site-independent’ approaches, where Gaussian process models are applied to sites individually, and multi-site approaches, where forecasting for multiple sites is performed via collaborative Gaussian process models.

Site-Independent Models. Consider the timeseries of power observations for a single site p, denoted \(y_{p}\), at times \(t=0,\ldots ,N\). As in (1), let

$$\begin{aligned} y_{pt}=f_{p}( \mathbf {x} _{pt})+\epsilon _{pt},\quad f_{p}( \mathbf {x} ) \sim \mathcal {GP}(0,\kappa _{p}( \mathbf {x} _{pt}, \mathbf {x} _{ps})) \end{aligned}$$
(3)

\( \epsilon _{pt} \sim iid \mathcal {N}(0,\sigma ^{2}_{y_{p}})\). Under the gp specification, observed power \(y_{pt}\) is a function of a latent Gaussian process, \(f_{p}( \mathbf {x} _{pt})\), plus idiosyncratic noise \(\epsilon _{pt}\). The covariance between power at time t and time \(s, s \ne t\) is thus given by the kernel function \(\kappa _{p}( \mathbf {x} _{pt}, \mathbf {x} _{ps})\). The likelihood function is given by \(y_{pt}|f_{pt}\sim \mathcal {N}(f_{pt},\sigma ^{2}_{y_{p}})\).

In the site-independent setting, the feature vector \( \mathbf {x} _{pt}\) is comprised of two main elements: a time index t and a set of lagged power observations at pre-specified five minute intervals denoted \( \mathbf {g} \). In order to forecast power at \(t+\delta \) for \(\delta \) steps ahead, lag features are current observed power and past observed power at 5 and 10 min prior. Thus \( \mathbf {x} _{pt}=(t, \mathbf {g} _{pt})\), \( \mathbf {g} _{pt}=(y_{pt-\delta },y_{pt-\delta -1},y_{pt-\delta -2})\). Lags were selected in line with previous studies that find immediate lags are relevant for short term forecasting (see e.g. [26]).

Additional, ‘extended site-independent’ models are estimated using an augmented set of lag features. The feature vector is extended to include power observations of nearby sites, that is \( \mathbf {g} _{pt}=(y_{pt-\delta },y_{pt-\delta -1},y_{pt-\delta -2},y_{-p t-\delta },y_{-p t-\delta -1},\) \(y_{-p t-\delta -2})\), where \(y_{-p}\) denotes all sites near to site p. Utilising cross-site features in the form of lags allows separate site model estimation and has been applied in several studies including [3, 12]. We define ‘near’ as being within a 10 km radius.Footnote 2

Kernel functions for site-independent and extended site-independent models are comprised of several separable kernel elements. A periodic kernel is applied to the time index to capture daily cyclical trends in output and is defined as

$$\begin{aligned} \kappa _{Per.}(t,s)&=\theta \exp \left[ -0.5\left( \frac{\sin \left( \frac{\pi }{T}\left( t-s\right) \right) }{l}\right) ^{2}\right] \end{aligned}$$
(4)

where \(\theta \) governs cycle amplitude, T denotes cycle period (fixed at one day), and lengthscale, l, governs rate of decay in covariance as the time-span between observations increases.

A linear kernel is applied to lag features \(g_{i} \in g\) to capture short term variations from the regular diurnal trend:

$$\begin{aligned} \kappa _{Lin.}( \mathbf {g} _{pt}, \mathbf {g} _{ps} )&=\sum _{i}\sigma _{i}g_{pti},g_{psi} \end{aligned}$$
(5)

where \(\sigma _{i}\) are in effect weight coefficients. The overall kernel structure for all site-independent models is:

$$\begin{aligned} \kappa _{p}( \mathbf {x} _{pt}, \mathbf {x} _{ps})=\kappa _{Per.}(t,s)\kappa _{Lin.}( \mathbf {g} _{pt}, \mathbf {g} _{ps}). \end{aligned}$$
(6)

Multi-site Models. Values for proximate sites would be expected to covary, due to both synchronous diurnal cycles in unflattened data and shared weather systems. Some efficiency would thus be expected from exploiting the shared covariance structure through collaborative learning.

Two separate multi-site gp model structures are estimated for site-level power forecasting. The first is a pooled structure, where (standardised) site data are used in a joint specification with shared kernel parameter values. The second structure is the linear coregionalisation model or LCM. This structure assumes site observations covary through a lower dimension set of shared latent processes.

For each multi-site model structure, two alternative kernel specifications are explored. These four model specifications are detailed below.

Pooled Model. The pooled, or ‘joint’, model is a pooled Gaussian process model where all site observations share a common kernel that includes an additional kernel element defining a spatial covariance factor.

The first pooled model kernel (‘Joint Model 1’) is defined as a multiplicative, separable spatiotemporal kernel added to a shared linear kernel applied to lagged power values. Feature vector \( \mathbf {x} \) is extended to include \( \mathbf {h} =(latitude,longitude)\) i.e. \( \mathbf {x} _{pt}=(t, \mathbf {g} _{pt}, \mathbf {h} _{p})\). A radial basis function (RBF) kernel is applied to \(h_{i} \in \mathbf {h} \) to capture spatial dependencies, thus for sites p and q,

$$\begin{aligned} \kappa ( \mathbf {x} _{pt}, \mathbf {x} _{qs})=\kappa _{Per.}(t,s)\kappa _{RBF}( \mathbf {h} _{p}, \mathbf {h} _{q})+\kappa _{Lin.}( \mathbf {g} _{pt}, \mathbf {g} _{qs}). \end{aligned}$$
(7)

where

$$\begin{aligned} \kappa _{RBF}( \mathbf {h} _{p}, \mathbf {h} _{q})=\sigma ^{2}\text {exp}\big \{-\frac{1}{2}\sum _{i=1}^{2}(({h_{pi}-h_{qi})}/l_{i})^{2}\big \}. \end{aligned}$$
(8)

In the RBF kernel, \(\sigma ^{2}\) governs maximum covariance between points \(h_{p}\) and \(h_{q}\), and lengthscale, \(l_{i}\), governs the rate of decay in covariance as distance between observations along the relevant axis increases.

The second joint model (‘Joint Model 2’) is similarly specified however replaces the shared linear kernel with separately parameterised linear kernels for each site. Specifically, \(\kappa _{Lin.}( \mathbf {g} _{pt}, \mathbf {g} _{qs})\) becomes

$$\begin{aligned} \kappa _{Lin.,p}( \mathbf {g} _{pt}, \mathbf {g} _{qs})=\sum _{i}\sigma _{pi}g_{pti},g_{qsi},\quad \kappa _{Lin.,p}=0\quad \text {for}\quad p\ne q \end{aligned}$$
(9)

Coregional Model. The linear coregional model (LCM) assumes \(y_{p}\) is a function not of a single latent process \(f_{p}( \mathbf {x_{p}} )\) but a linear combination of several independent latent Gaussian processes. Covariance between sites arises from these shared latent processes. Weights defining the linear combination for a given site are site-specific,Footnote 3 \( f_{p}( \mathbf {x} )=\sum _{j=1}^{Q} w_{pj}u_{j}( \mathbf {x} ) \text {.} \)

We assume three latent processes \(u(x)_{j}, j=1,...,3\) in the first LCM model (‘LCM Model 1’) and two latent processes in the second model (‘LCM Model 2’). Each latent process has an associated kernel, \(\kappa _{j}\), giving rise to a shared covariance structure across sites driven by both kernel elements and weight matrices.

Let \(\mathbf {B}_{j}= \mathbf {W}_{j} \mathbf {W}_{j}'+\kappa _{j}\) where \(\mathbf {W}_{j}\) is a \(p\times 1\) matrix of weights \(w_{pj}\), and \(\kappa _{j}\) is a diagonal matrix of isotropic noise. We define \(\kappa _{1}=\kappa _{Per.}(t,s)\) and \(\kappa _{2}=\kappa _{RBF}(t,s)\) respectively as periodic and RBF kernels applied to time indices. The third latent process kernel is defined as \(\kappa _{3}=\kappa _{Lin.}( \mathbf {g} _{t}, \mathbf {g} _{s})\). The shared kernel structure in LCM Model 1 is thus given by:

$$\begin{aligned} \mathbf {K} (f_{p} ( \mathbf {x} _{pt}),f_{q}( \mathbf {x} _{qs} ) )&=\sum _{j=1}^{3}\left[ \mathbf {B}_{j}\right] _{pq}\kappa _{j}( \mathbf {x} _{pt}, \mathbf {x} _{qs}). \end{aligned}$$
(10)

The second coregional model is similar to the above, however again the linear kernel component is treated slightly differently. In LCM Model 2, \(Q=2\) with \(\kappa _{1}\) and \(\kappa _{2}\) defined as above, and lag features are included in a separate kernel component defined as in (9). Thus

$$\begin{aligned} \mathbf {K} (f_{p}( \mathbf {x} _{pt}),f_{q}( \mathbf {x} _{qs}))&=\sum _{j=1}^{2}\left[ \mathbf {B}_{j}\right] _{pq}\kappa _{j}( \mathbf {x} _{pt}, \mathbf {x} _{qs})+\kappa _{Lin.,p}( \mathbf {g} _{pt}, \mathbf {g} _{qs}). \end{aligned}$$
(11)

The specification in (11) allows a slightly more expressive parameterisation of the linear kernel than (10).

Benchmark Models. Without clear sky normalisation and under the assumption of short site history, there are few existing models for comparison. One feasible benchmark prevalent in the literature is the persistence model,Footnote 4 which forecasts the next observation as the current observation i.e. \(y_{t+1}=y_{t}\). Persistence models are estimated for each site separately.

In addition, the site-independent Gaussian process models serve as a benchmark. These are approximately equivalent to linear Bayesian regression models assuming a standard Gaussian prior distribution over regression coefficients. As such they are closely related to VAR models as applied to flattened data.

3 Experiments

The analysis makes use of a sample of 37 residential photovoltaic systems installed within an approximately 10 by 15 km ‘box’ in the central Adelaide area. Most sites have an installed capacity of 2 to 5 kW. The dataset is comprised of 5-minute average power readings over a 30 day period in January 2017 (specifically 30 days ending 28 January 2017). Days were defined as 7 am to 7 pm, yielding a total of 144 observations for a site over a day. This accounts for a total of 159,840 observations, which is clearly unfeasible for standard (non-scalable) gp models.

The goal of the experiments is to test whether gp models estimated under a sparse variational framework can be applied to forecast distributed power output at the site level for multiple distributed sites. In particular, whether (a) combined kernel forms can be used to model nonstationary data characteristics, and (b) collaborative learning can improve forecast accuracy or reduce data requirements compared to independent site forecasts.

The four multisite models set out above are used to forecast output for each site. These are compared to results under the site-independent and persistence models. Models are trained for forecasting horizons from five to thirty minutes at five minute intervals. The forecast target in each case is five minute average power at that horizon. Models were trained using the first 60% of observations (18 days). Forecasts were then generated for a test set of the following 40% of observations (12 days) for each site.

All models are estimated via the sparse, variational approach described in Sect. 2. Inducing points are initialised at cluster centroids and optimised within the model. To illustrate the scalability of the approach, we use 2300 inducing points for joint models, or approximately 2.4% of the data dimension. Maintaining the same ratio, 63 inducing points per site were used for individual models.

3.1 Accuracy Metrics

Forecast accuracy is assessed for each site for each model using three measures: mean absolute error (MAE) in kilowatts, standardised mean squared error (SMSE) and standardised mean log loss (SMLL), as defined in [21]. Specifically,

$$\begin{aligned}&SMSE=\frac{1}{N_{te}}\sum _{i=1}^{N_{te}}\left( \frac{y-\hat{y}}{\sigma _{y_{te}}}\right) ^{2}\end{aligned}$$
(12)
$$\begin{aligned}&MAE=\frac{1}{N_{te}}\sum _{i=1}^{N_{te}}|y-\hat{y}|\end{aligned}$$
(13)
$$\begin{aligned}&SMLL=\frac{1}{N_{te}}\sum _{i=1}^{N_{te}}(nlpd_{i}-nll_{i}), \quad \text {where}\\&nlpd_{i}=\frac{1}{2}\left[ ln(2\pi )+ln\sigma ^{2}_{\hat{y_{i}}}+\left( \frac{y_{i}-\hat{y_{i}}}{\sigma _{\hat{y_{i}}}}\right) ^{2}\right] , \nonumber \\&nll_{i}=\frac{1}{2}\left[ ln(2\pi )+ln\sigma ^{2}_{y_{tr}}+\left( \frac{y_{i}-\mu _{y_{tr}}}{\sigma _{y_{tr}}}\right) ^{2}\right] . \nonumber \end{aligned}$$
(14)

Subscripts te and tr refer to training and test sets respectively and \(\hat{y}\) denotes the predicted value of y. SMSE is standardised by reference to test set variance \(\sigma _{y_{te}}^{2}\). Values less than one indicate the model improves on a simple mean forecast. SMLL measures the (negative log) likelihood of the test data under the model, denoted nlpd, relative to (negative log) likelihood under the trivial normal distribution with parameters (\(\overline{y}_{tr}\), \(\sigma ^{2}_{y_{tr}}\)), denoted nll. More negative metric values indicate better relative performance of the model.Footnote 5

3.2 Results

Forecast Accuracy. Results at the site level suggest the site-independent model performs as well as or better than the joint (pooled) model in terms of average site accuracy (Fig. 1). SMSE for both the site independent and joint models ranges from 0.05–0.12 over 5–30 forecast horizons, however MAE and SMLL are consistently improved under the site-independent model over all forecast horizons e.g. MAE of 0.14–0.26 kW versus 0.17–0.29 kW under site and joint models respectively. The LCM specifications perform poorly on all measures relative to the joint and basic site-independent models. At all forecast horizons, the better performing models (joint and site-independent) are more accurate than the persistence benchmark.

Additional expressiveness in the kernel due to the more flexible linear lag kernel structure does not significantly improve forecast accuracy in the joint or LCM models, and in some cases tends to contribute to higher forecast variability across sites (Fig. 1). Interestingly, the extended site-independent model performs very poorly relative to other models, however forecast error remains fairly stable over 10–30 min horizons. This result may indicate overspecification of this (very flexible) kernel structure.

Fig. 1.
figure 1

Site forecast mean error and error variability

Estimation of Daily Power Curve. It is difficult to evaluate the current approach as an alternative to those that require flattening the data without a direct (flattened) benchmark for the given dataset. However, examining forecast accuracy on clearFootnote 6 days provides some insight into how the approach accounts for clear sky curves. Table 1 summarises forecast accuracy under the site-independent and joint model 1 specifications for clear (or mostly clear) and cloudy days in the test set, which each represent 50% of the test data.

Table 1. Mean site forecast accuracy on clear versus cloudy days

Forecast accuracy appears competitive on clear days, with mean MAE across sites of 50 Watts on clear days at the five minute horizon, rising to 130 Watts at the 30 min horizon. Given mean power for the full dataset set across clear and cloudy days of 2.1 kW, MAE represents around 2.4 (6.2)% of mean power at the 5 (30) min horizon. Similarly, SMSE of 0.002 to 0.009 (for site-independent results) over 5 to 30 min horizons implies that average mean squared error is less than one percent of total power variation on clear days.

Considering transfer learning more generally, it is relevant to note the better performance of the joint model on cloudy days, which contrasts with the better performance of the site-independent models on clear days (Table 1). On all measures, the best performing joint model performs consistently better during variable weather, while the opposite is true for sunny weather periods (noting accuracy is significantly diminished for both models on cloudy days). This suggests ‘negative’ transfer effects with respect to forecasting diurnal cycles, while forecast errors are somewhat moderated during cloudy periods by the joint model.

4 Discussion

The scalable, approximate Gaussian process methods appear to have significant potential in the distributed forecasting setting. We are able to produce probabilistic site level forecasts using a flexible, nonparametric method in a large scale setting. Further, the approach seems to incorporate diurnal cycles within an integrated model successfully without exogenous site information.

Gaussian process based models produce a strong level of accuracy on sunny days for forecasts out to the 30 min horizon. Overall, however, accuracy of models during cloudy conditions appears low. Given the absence of feature data beyond location, time and output, however, it is possible accuracy can be substantially improved (as in [16]) via inclusion of weather or other external data, including site features where available.

Overall accuracy of forecasts is not improved by jointly estimated models (pooled and coregional) compared to site-independent models. Performance in cloudy versus clear weather, however, illustrates that there may be potential for transfer learning benefits during more variable weather.

One possible factor affecting model performance is the spatial covariance kernel, which is a stationary function resulting in sites equally distant along a fixed axis being assigned an equal covariance regardless of current weather direction. Ideally, a spatial kernel would more specifically reflect current cloud velocity. Further, the stationary kernel assigns higher weight to closer sites, which may not be optimal as forecast horizons increase. A more refined kernel or adaptive model structure may thus assist in identifying relevant cloud-related data features for transfer learning in a forecast setting.